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with aspects of " statistics. The first provides knowledge of the 
effect of imperfect correlation and random'error on differences, 
between means, and the reasons for the necessity of random allocation 
of objects to experimental and control conditions in scientific 
experimentation. The second unit shows how to: 1) Use frequency 
distributions aijd histograms to summarize data; 2) Calculate mean's, 
'medians, and modes ai measures of central location; 3) Decide which 
measures of central location may be most appropriate in a giv^n 
instance; and 4)> CalcuHCte and interpret percentiles. The third 
module is desigi^d to enable the student to': 1) di*scuss : how * 
approximation is -pervasive i n statistics; 2) compare "structural 11 and, 
"mathematical" approximations to probability models; 3) describe and 
recognize a hyper geometric probability distribution and. an experiment 
in which i't' holds; 4) recognize 'When hypergeomet'*ic probabilities can 
be approximated adequately by binomial/ normal, or Poilsson * 
probabilities; 5) recognize when binomial probabilities can be 
approximated adequately by normal or Poisson probabilities; 6) 
recognise when the normal approximation. to binomial probabilities 
requires the continuity correction to be adequate; and 7) calculate 
with ct calculator or computer hyper geometric or binomial 
probabilities exactly or approximately. Exercises and tests, with 
answers, are provided in all three units. (MP) 
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ABSTRACT 

• Regression toward th'e mean is a phenomenon that is a 
natural by-product of less than perfect correlation be- 
tweenvtwo variables, but regression effects have* often ' * 
been mistaken for treatment effects in poorly-designed 
** experiments. The purpose of this module is* to explain, 
theoretically and empirically, this bothersome concept. 

I s . IN TRODUCTION , 

/ - ' 

Did you ever noticy that the sons of very tall men ar? 
usually also tall but not quite as tall as their fathers? 
And that the sons of very short fathers tend to be not as 
short as jJifiju: fathers? The 'famous anthropologist Francis'. 
Galton did," and he once believed that this wpuld ultimately 
lead to the elimination of the very tall and the very 
short. Will it?^ 6 

Probably not. ^As we shall see, this kind of "regres- 
sion"' is a statistical artifact of the imperfect correla- 
tion* between any two variables (e.g., height of father and 
height of son). Unfortunately the lack of understanding of 
the principle continues to be k problem in scientific re- 
search. 

2.- WHAT IS REGRESS ION TOWARD THE MEAN? 
2.1 -Definition 

Regression, toward the mean is the pherio'menon/w hereby a 
-high (low) set. of observations on one variable is associa- 
ted with a mean on another variable that is also high (low) 
but tnat is closer to the overall mean for that other vari- 
able. It £s of.no real scientific importance whatsoever;' 
it is a necessary consequence of ifess than perfect correla- 
tion between jtwo variables. 

2^2 A Numerical and Gr aphical Illustration 

Consider the scatterplot in Figure 3^ for two variables 
X and Y that are on the same scale (the Pearson product- 
moment correlation coefficient "for those data "is 0.5),; "and 
pay special attention to~the left-most array of four 'points 
(for X=l). The overall mean f>or variable X is 4, so those 
four observations are low relative to that meari. ^ Note, 
however, that the mean for variable Y for those same obser- 
vations is 2.5, which is closer to the overall mean for 
variable^ Y (als6 4) than the 1 is to the mean of 4 for 
variable X. The reason for this is simply the shape of the 
scatterplot. Since there is not a' perfect linear relation- 
ship between ttye two variables, the N most extreme observa- 
tions on X are not necessarily associated with the mo$t 
extreme observations on Y. ' When the very lowest X measures 

5 



are considered, the corresponding measures for Y have no- 
where to go but up, so to speak. ' 

This phenomenon also operates from the top down, as 
well as from the bottom up. Again referring to Figure 1, 
the right-most array of four points (for x=7) produces a 
mean for variable Y of 5.5, .which is closer to the overall 
Y-mean of 4 than 7 is to the overall X-mean of 4., 

For simplicity of illustration, the Y measures of 
Figure 1 were put on the same scale as the X measures.^ 
That is not necessary, , however . The general shape of\h^ 
scatterplot renains the same if either X or Y is trans- 
formed linearly. ^ 
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Y on X regression 

line 

Y = 0 . 5x + 2 

— ► 



Figure 1. An illustration of regression toward the 
mean. (Adapted from Campbell, D.T. and Stanley, J.C., 
Experimental and quasi-experimental designs for re- 
search; Rand McNally, 1966, page 10. The numbers next 
to some of the points are the frequencies of those ob- 
servations. The points without numbers represent 

. single observations. The total number of observations 
is 58.) 



1*2 Mathematical Explanation 

A single illustration is 'not -a sufficient explanation 
of a phenomenon. The following algebraic argument treats 
the , general case. 

Consider the equation of the regression line for Y on 
X, namely * - „ 

(1) • Y = bx + a, 



6. ♦ 



where* 

(2) b = r • 



L xy S x 



and 

(3') .*. - a ^ M y - bM x . m > 

. (In, these equations, M x and Hy are the overall * means*, S x 
\ and Sy-are the overall standard deviations, and ^r xy is the 

correlation between t)ie two variables.) Substituting the > 

values given^^Eqs.^(2) and*-C3)^for b and a into £q. (1), 
we have . 

•' /*> • ' \~ r xy x + "y." r xy § n x. • / ' 
Rearranging Eg. (4) algebraically leads to 

*- r xy.^ I* " H x> ♦ "y, 

4 or s 

I' Y - H y = r xy < x " H x>' " . •" 

or 

Y-My x_M x. • ■ 

(5) • s v = - tx y "sT"' ' ' ' . 

This is the soTcalled "standardized" form of the regression*, 
equation. * . 

Now conside*r a set of observations for which X is, ,k 
standard deviations from M v . Then 



Y-M v ' (M v +kS v )-M Y 



Since |r | £ 1, the~v.alue of Y' on the regression line that 
"goes witn M this extreme vjyLue -of X (the Y-meah for the 
array) must be less than or^ .equal to k -standard deviations 
from My (equality hol % ds only # .if r xy = ± 1) . . That's regres- 
sion toward the mean, no matter what the values of k, r xy , 
M x , My/ S v , and S„ are. 1 



( 
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3. SOftlE OTHER EXAMPLES - 

1U Reading Improvement 
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An educator gives a reading achievement test to a 
group of third-grade pupils, picks out the pupils who 
obtained the lowest scores* on the test, gives them a twp- 
ftionth remedial reading program, tests them again* and' ob- 
serves that their scores are significantly higher. Is this 
evidence that the program has been successful? "Not neces- 
sarily. It could be regres^on toward the mean; scores on 
the two tests probably do not correlate perfectly with one 
another* 

® 7 \ ' 3 



3.2 Smoking and Lung Cancer " * . 

. A physician examirfe's 'several cancer patients, obtains 
a medical history of thei r. cigarette spoking behavior, and 
discovers that those who smoked the most had only slightly 
more, than aa average a&aunt of lung cancer. Does'Jihis mean 
that 'if you're going^ to smoke cigarettes you might as well 
smoke a lot? Perhaps; but there -smay be regression toWard 
the mean* here, too. Althdlicjh there is a* positive correla- 
tion between ^number df cigarettes smoked and amount ^of lung 

p » 

cancer,' the correlation is far from perfect. v 

* * • 4.. BUT WHAT IS IT TH^T RE&REgSES * - 

TOWARD WHICH MEAN? - ' ' 

\ *' 

This" question can-be^ best answered in the context of 
two technical /v but simple, , statistical concepts, namely 
expectation and conditionally . The expected value of a 
variable, say Y, is the mean value of that variable, usual- 
ly written as*E(Y)\ The conditional expected value of "Y is 
the mean value given some constraint , say X", and is usual- 
ly written a's E(Y|X) . 

Regression towar.d the mean is concerned with the com- 
parison between the quantities X - E(X) and -E(Y|X) - E(Y). 
Referring to Figure l.a^ain, the (standardized) distance 
betareen any X and the mean of X is always greater than or 
equal to the tdi stance « between the- mean of -Y for that X an 1 
the overall Y mean. So it is*E(Y|X) that regresses* tow^c; . 
^(Y^) , relative to the- discrepancy between X and E(X) . If 
the correlation between-X and Y is (T, i.e., if the scat^er- 
plo&> forms a "buckshot" pattern, ,the»*r egression is maximal 
and E(Y|X) = fc(Y). If the correlation is +1 *or< -1 there is 
no regression toward the mean , since the " (standardized) 
distance betv;een E'(Y|X) and E ( Y) ' is] the same as the (stand- 
ardized) -distance between X and E(H). V 

1- AN EMPIRICAL DEMONSTRATION OF THE PHENOMENON - 

Take two .decks of ordinary playing cards. Seleqt the 
sevens, eights; and nines from one deqk and call this 
redunced deck of 12 car;c|s.Deck A. Select the aces (ones)* 
through nines from* the other 'full deck and call this re- 
duced deck, of 36 cards Deck §. '» Pencil in the number -2 on 
•each of. the aces in Deck B; the number -1 on ea*ch oi the 
* twos and threes; the number 0 on each of the fours, fives, 
and sj.xes? "the nuiifeer +1 on each of the sevens and! eights; 
and the number, +2 on each of the nines (all in Deck B) . 

' For each car<3 in Deck A draw a card at random' ( with 
replacements from Deck B. ("With replacement" means that 
you put the card back in the deck before you shuffle and 
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draw, another one.') Add the 12 paixs of numbers (the actual 
denomination for the card in Declr^v and the number -2, -1 , 
0, +1, or +2 drawn from Deck B) . For example, paired with 
the seven of spades in Deck A you might have a -1 from Deck 
B. Adding' these together you have 7 6. 

Now pick out the six largest sums (using any conven- 
ient randomizing procedure to resolve ^ies) and find their 
mean. (See Table 1 for an example of this step and all 
subsequent steps in the demonstration.) set aside the six 
cards f^om Deck A that did not contribute to the"largest . 
•sums/ They will no longer be needed. * * | 

For the same six cards from Deck A that d^d'contribate 
to the six largest sums, repeat the pairing, -summing, and 
averaging process Using six cards drawn at random from Deck 
B. Compare the two means. * The second one 'should be lower. ' 
Do you know why? (Try to thirik of a reason before you read 
on.)-- x - 

TAQLE 1 » 



One set of Empir ical Results 
(regression toward the mean) 



"First testing" 
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A cards 


Deck 


B cards 
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Second 'testing " 








Deck 


A cards 


PecK 


B cards 


Sums 




7 * * 


A 


(-?) 


5* 




8* - 


2 m 


(-1) 


7 




8 


3* 


(-1) 






9* . 


6 


( 0) 


' 9 




9 , 


8 


(+1) * 
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7 
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The sevens, eights, and nines originally-chosen from ' 
*the first; full* deck of cards are analogous to' scores on a* 
test that the 12 brightest * of 4 ^6 students deserve to get, 
(The other 2*4 deserve to get one through* six, "Forget about 
the tens, ]acks, queens, and kings.) The "ij sums are 
scores that they actually dp. get, scores tha,t contain a 
random error component. (They a.11 deserve high scores, but 
by chance s,ome will "have a bad day" and obtain .Scores that 
.are less tKan "the ones they deserve, while others will 
"haye a good 'day" and obtain scores that are greater than* 
the bnes they* deserve . ) 

At' the second "testing" the scores obtained by" the « 
"people" who had the six highest scores «the fitsz time 
would npt be expected to correlate perfectly (because of 
the chance error components) with the first atypically high - 
scores. Ergo, regression '(downward) to the mean. ' 

The moral to all(of this is: if a group of people 
scoxe very high on attest one time and get lower scores the 
next time, don't be surprised and don't get too concerned. 
The same implication holds at the low end of the*scale: if 
a group of^people score very low on a test on% time and get. 
higher sco'res the next time, don't get too elated. In both 
cases it could be wholly or partially regression toward^ the 
mean. ' 

. , ^ EXERCISES f ~ 

1. Demonstrate for yourself that the implication .ju$t mentioned does 
hold at the low end of the scale by carrying out the demonstration 
described in Section 5 again. This time use the aces, twos, and 
threes from the first full deck of cards as Deck A, and pick out 

. the six lowest sums. - >. 

2. Referring back to example 3.1, think of a reading improvement pro- 
gram being given to the "people" who obtain. Ihe six lowest scores 

■at time 1, with the scores at time 2 as a measure of their per- 
formance at the end of the program. Do you see now why the "im- 
provement" is a statistical necessity? ' 

« 

• L. WHAT ■ CAN B E DOME ABOUT TT * ' • ' % 

) ' % 

In Experi mental Research 0 

Whenever we're seriously interested in the effective- 
ness of a reading improvement program, a weight reduction 
plan, a headache remedy, etc., we should use J^'groups of 
people, rarfdomly assigned to either receive (the experi- 
mental group) or not receive (the control group) the par- 
ticular treatment in which we are interested, if all of 
the people happen to be recruited from extremely ttigh or 
.extremely low portions of some; score distribution and afe ■ 
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given a pre-test before the experiment and a post-test af- 
"*ter the experiment, the regression toward the mean effect 
will still take pl^ace, but it will be balanced across the 
two groups* For the reading- program example/. if people* who 
v ^ get ver*y low, scores on the reading pre-test are randomly 

assigned to experimental (they get the program^ and control 
(they don't) groups, 'both groups' will do belter on the 
». post-test due to regression- toward the mean/ but if the 
, program is* really effective the members of the experimental 
group will score that much higher. 



$ 



1^2 In Non-Experimental Research 

, * The only thing that can be done in non-experimental 

research is to do the beet we can in distinguishing between 
/ a legitimate finding and a regression 'effect. ^Forv'the • 
smoking 4ncf lung'.cancer example, the heights of sons vs. 
heights "of fathers example, and similar studies, the ex>- 
fcreme measures on one variable are usually associated with 
ljsss extreme measures on the Qther yariable for purely sta- 
tistical reasons* (Selective mating has something to do 
with irfcreasing the correlation between fathers' heights - 
*»*\ and sons 1 heights, biA^the regression effect provider a 
-sufficient explanation for the reduction to "mediocrity" 
# €hat Galton observed.) "\ 

•Some' people think that •matching.-can take care v of prob- 
lems associated with^ regression toward, the mean but, 1 alas, - 
it can't. In a w^ll-known study 'by Eel^n Christiansen of - 
■ A ' the effect of high, sen op 1 graduation on'ecanomic adjustment 
.during the early days of the -depression, an original sample 
* of 21 27 Jpeople was reduced tp 23 matched (on six background 

*>♦ variabl es) ■ pairs of graduates and ^non-graduates, with the 
graduates exhibiting better adjustment th«n the non-gradu- 
* ates. BUt*the regression Effect qould very well account 
f pr -the difference since the non-gr aciuates who had been 
matched with the graduate's on such things as mental ability 
^ *and neighborhood status (both of whici> are ^positively cor-* 
r elated ^with ■'economic' adjustment ) were well above average ' 

* relative to their fello w non-graduates $rid would, Be expec- 
' i . t0 re 9ress further (to their own population jnean) tnan 

• the graduates at the follow-up testing ten years later; 
thereby making the graduates appear to be better adjusted 
economically. *• % * 

Note that it is not feasible to study the effect of!** 
"high school graduation on economic adjustment experimen- 
tally^ since it- is socially unacceptable to assign some 
•people to receive a high school -education anci to withhold 
•it from*othersT However, there are better ways than the 
. * matchea-pai-r s technique to con t col for confoundirfg back- 
O 7 N 
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ground variables, techniques thaf^are also le^s subject to 
regression effects and do not result in the shrinkage of 
.the research sampie. ' )> 

One final point: the regression effect works "back- 
wards" as^ell as "fdrwards" statistically, even though, it 
makes absolutely no sense sc*ie<ntif ically . Very tall sans 
have ^athejrs who 8 are closer tp average height than they 
axe, which ^hould convince you*,* if this module* ,ancf your 1 
previous exposure to statistics have nojb ajrea^dy done so, 
that correlation per se does not necessarily imply cassa- 
tion. " ^ ' 

/ * *, " ' 

, « ( REFERENCES «* " 
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2_, END-OF- MODULE QUIZ • * " 

1. If a °f people wt\o exhibited great test anxiety before 

, counseling had greater test anxiety after counselingt is regres- 
sion toward the mean a lively explanation? Why or why not? 

2. If the regression e<iuatiop for y\n X is V = 0.75X +>1.5/ 

M v i M« = 6, and S v = S__ = 2 » what is the.mean on variable Y for 
x J y x y ' ^ • 

ten observations for which X = 5?- Does that rtake sense? Why or 
why not? ' * * 

3. (jBQnus question) In some experiments the people in the ex]?eri- 
jnentA group and tfhe people in the control group are jthe«£ajae 

, people, i.e.', they receive both treatments* Is regression toward 
the mean a problem in such experiments? Why or why not? 
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ULU Answers ta~exercises * V ■ * 

lt w 5^ ed fine 'for WeV^Che six lowest siims that I got were to&r 
1*8 and two.2 f s, with a mean of 1.33. The corresponding sums the 
next time^wjere 0 » 1 "» 1 , 3 » A » and 5, 'with a mean of 2.33, which is 
a point higher (and closer to the' overall mean) than the. first 
one* t , * 

It is artifactual because the six lowest people" ha^ bad luck the 
first time, and since luck $£a$s no favorites ..they couldn! t all 
have bad luck the second tiiHe; therefore, as a group they scored - 
higher and would have done so w*ith or without the program. 

10. 2i Answers bo the Quiz 

No, regression toward the mean is not a likely explanation, since 
they scored high 'the first time and higher, not lower, the .second 
^time^^rfie jegr^gs-i on effect is only ^relevant for high to lower 
and low fd higher mean differences, v-e., ari originally high /group 
scores lower the second time or an originally low group scores 
higher the* second time. 

The evidence suggests that the program was not only not ef- 
fective, but harmful. However, since there was no 'control group < 
(which would be treated in "the same way as the experimental group 
except that they don't get the counseling) we caftnot be sure that 
the* counseling itself was ineffective* The disappointing results 
-may be due to the counselor, the office in which the counseling 
^ok^lace, some- other event that transpired during the counseling 
period, etc. 

.Substituting X = 5 -in the regression equation, we obtain Y = 5.25. 
Th£ 5.25 is closer to the mean of Y than the 5 is to the mean of 
*X, so "it indeed does make sense. X = 5 is notran extreme observa- 
tion (ll^.is. only one-half, of' a standard deviation below the" mean ? 
of X), Dub-the regression effect actually works on all ofTthe . 
observation^ not just the extreme ones, as Eq. (5)' attests. 
* 0 The correlation coefficient- for theae\da£aj by sthe way, is 
the same as the regression slope; b, i.eV, 0.75, since 



and 



•*>- s 



Yes, since pre-test and post-test scores still won* t v correlate 
perfectly. Things get a little more complicated, however, since 
yoiPcpuld have three or four, # rather than two, testings to contend 
with: pre- testing before Treatment A, post-testing after Treat- * 
ment A, pre- testing before Treatment B (which may be the same 

9 



•testing as the post-testing after Treatment A), and post-testing 
after Treatment B. The posd-A scores should be closer to the mean 
than the pre* A scores, due to the regression effect, but since the 

.experience of Treatment B is often not contemporaneous with the 
experience of Treatment A (the people usually can 1 j^e undergoing 
both treatments at once), the regression from pre-B^Ki post-B may 
not be comparable. * 



/ 
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STUDENT FORM 1 
Request for Help 



Return to: 
EDC/UMAP 
55 Chapel St. 
Newton, MA 02160 



Student: If you have trouble with a specific part of this unit, please fill 
out this form and take it to y.our instructor for assistance. The -information 
you give will help the author to revise the unit. 



Your Name 



Page 








0 Upper # 


OR * - 


Section 


OR 


OMiddle . 




paragraph 




0 Lower 









Unit No. 



Description of Difficulty: (Please be specific) 



Model Exam 
Problem No._ 

Text 
Problem No. 



Instructor : • Please- indicate your resolution of the difficulty in this box. 
Corrected errors in materials. List corrections herel 



o 



Gave student better explanation, example, or procedure than in unit. 
Give brief dutline of your addition jiere: * t 



o 



Assisted student in acquiring' general learning and problem-solving 
skills (not 'using examples from this unit.) 



Instructor's Signature 



Please use reverse if necessary. 



Return to: 

STUDENT FORM 2 • EDC/UMAP , 



. Unit Questionnaire 

♦ 

Name U nit No.j _ Cate _ 

Institution ' Course No. 



55 Chapel St. 
Newton, MA 02160 



Check the choice Tor each question ^that comes closest to your personal opinion. 
1 . Hov useful w£s -the amount of detail in the unit ? 

f'4 • 



_Not enough detail* to understand the unit 
_Unit*would have been clearer with more detail #r 
^Appropriate amount of detail \ x • 

JUnit was occasionally . too detailed, but this was not 
Too much detail; I was often distracted 



istracting 



2 . *How helpful were the problem answers ? 

Sample solutions were too brief; I could not. do the intermediate steps 

• Sufficient information was given to solve the, problems 

■ Sample solutions were too detailed; ! didn't need them 



3 . Except for fulfilling the prerequisites, how much did you use other sources (£ot 
example, instructor, frfends, or other books) in order to understand the unit? 

A Lot Somewhat A Little ' Not at all 



4. How long Vas this unit in comparison to the amount of time you generally spend on 
a lesson (lecture and homework assignment) in a typical math <xp science cpurse? , 

Much* Somewhat ' About Somewhat ImucH 
Longer Longer the Same j Shbrter Shorter 



* 



5. . Vijere any of the following parts of the unit confusing or distracting ? (Check 
many as apply.) , * % 

Prerequisites * 

Statement of^ skills ancf concepts (objectives) 



Paragraph 'headings * 
..Examples «■ 6 t 

_Special Assistance Supplement (if present) 
Other, please explain 



Were any of the following parts of the unit particularly helpful? (Check as many 
as apply . ) u . - ' ' * \ 

Prerequisites * * V, « 

^Statement of skills jand concepts (objectives) 

Examples 

Problems 



^Paragraph headings I 
JTable of Contents 

Special Assistance Supplement (if. present) 
Other, please 'explain 



Please describe anything in sthe unit that you did not particularly like. 



Please describe anything that you found particularly helpful. (Please use the back of 
this sheet if you need more space.) x " 
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BASIC DESCRIPTIVE' STATISTICS 

* Z * " ^ 1 

» 9 

*» 

1. THE NEED TO SUMMARIZE DATA ^ AN tXAMPLF 

» 

There is a quantitative side to almost everv aca- 
demic field. The geologist measures the hardness of 
various rock* spec imens . * The psychologist measures reac- 
tion times to a certain stimulus. The educator measures 
learning as it is reflected in scores # 'on achievement 
tests. The economist records income. The list could 
be extended for many pages. i 

After a set of data has-been collected the next 
task is to decide how to best present "it so that it is 
available to others in a quick and useful way. The # 
methods used to do this belong to a branch of study 
called descriptive jtatistias . Included i*n descriptive 
statistics are the methods of collection, organization 
and description of numerical information. The topics 
covered in jthis module are all from the fields of 
descriptive statistics. 

Suppose we have, collected the data bejow. 

HEIGHTS OF ONE-HUNDRED -EIGHTY 
17 YEAR-OLD FEMALES IN CENTIMETERS 
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-168 


158 
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161 


1S9 


163 


163 


170 



^ * -.1. 

• Data in this, form are called^ raw dapa. In this un- 
organised form the data can only be .understopd after a * 
certain amount of t imo; consuming exanunat ion . - if the 
.data set included several thousand numbers the need to 
.organize and summarize uould be even Greater 

2. MI/I HODS 01 SUMMARIZING DATA 



In* thi^> seefion ue will discuss two important 
methods of >ummarrzing data: the frequency distribution 
and the histogram. 

2.1' Frequency Distribution 

•The simplest way to organize data by means of a 
frequency distribution with one value in each class. *• 
Such" a distribution consists of a list of the values 
uhich appear in the data set, arranged in increasing 
order, and the frequencies which indicate the, number of 
times the various values appear. Such a frequency 
distribution for the data on page 1 appears below. 



HEIGHTS "OF 17 YEAR-OLD FEMALES 



HEIGHT fin cm) 


TALLY 
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II 


146 
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147 


IH 


148 


♦m 


149 




, 150 


i: 




h " 


. 1 5\ 
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- - 153 \ 


1 1 ; 


154 X 
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. Iln 


f " -156 
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N isi 




159\ 




160 \ 




' '161 \ 




1 162 


s/ HHHM 


' 163 


\\ +Hf il 


- 164 


V\ l«U|lf 


165 


•X ' -Htrl 



FREQUENCY 

2 
• 1 

3 

3 

4 

2 

2 

4 

4 

8 

4 

6 
13 
12 

9' 
19 
10' 

, 10 
7 

9 
6 



166 * * * K ' 7 

- 1« , > Hfti 7 

168 * * hh.u» * 8 

169 i ' 2 

170 *ui 6 

171 ■> 

172 • 4 
^73 • i 1 
174 - \\* * 3 

* o 

176 • 2 

177 „ 2 

178 0 

179 • 0 

i8a » i 



* 'TOTAL = 180 

♦ 

The tallies in the middle column above are included 
only as an indication of how the frequency distribution 
was obtained. It is not necessary, or even* des irable , 
tt> include these tallies with a frequency distribution. 

^ Already we have ma'de" signi f icant progress in the 
process of summarizing the data. this frequency dis- 
tribution allows us to "get a feeling" for the data 
much more quickly than was possible from, the raw data. 
Furthermore!, nothing has been lost. All of the 'infor- 
"mation which was available from the raw data is Avail- 
able in this frequency distribution. This summary is, 
however, less than perfect. There are 37 different 
classes; it ta'kes nearly *a full page t,o present this 
frequency distribution; and even -with the data i» this 
form it takes some time to digest it. 

The situation might have been worse. * Eac fc h height 
in this data set has apparently been rounded to the 
nearest centimeter, .If, instead, each height were 
rounded to the nearest tenth of a centimeter then there 
would have been many more classes and each class would 
have a very small frequency. In such a case the fre- 
quency distribution would represent only a small * 
, improvement over the raw data because it contains too 



much detailed information; there* are too maisV^^i f f erent 
values, ~ * v 

In other cases it may happen that a f requency d d i s - 
tribution-of the type just given is a very effective 
summary. For'example, the frequency di stribution shown 
below gives a quick and accurate description of the 
number of games played in the H'orld Series 'of Baseball., 

NUMBER OF GAMES IN THE WORLD SERIES (1923-1*978) 

» No. of Games Frequency 

4 11 

5 , 10 
K ' *6 ' 11 

7 24 " ' 



TOTAL. = 56 

Let us return to the set of data representing 
heights. We can condense, the frequency distribution -on 
page 2 by using intervals as our classes, rather than 
^individual values. I*or example: 

* HEIGHTS OF 17 YEAR-OLD FEMALES « 

HEIGHT (in cm) FREQUENCY • 

144.5—150.5 ^ 15 * 

150.5—156.5 / 28 
156.5— 162. Y ' ' 73 

162.5 — 168.5 44 
168.5—174.5 , - 15 

174,5—180.5 ' 5 

The first-class contains all *of the heights which 
fall between 144.5 cm.' and 150.5 cm. The number 144,5 

J is called the lower boundary, of the class and 150^5 is 
called the upper boundary. Note that the upper boundary 

* of one class is the lower boundary of the next class. 
In this example the class boundaries have been chosen, 
in such a way that no number from the data set is equal 
to a class boundary. Thus each number-can be placed- in 
one and only one class. By selecting class boundaries 
which contain one more significant digit than the data - 
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it is always possible,, to choose these boundaries so* . 
that they are 'distinct from the data. This is desirable- 
in order to avoid ambiguity. 

The midpoint, or class mark, of each class interval 
may be found by adding the upper and lower class bound- 
aries and dividing the sum bv 2. In the frequency 
distribution given above the class maV^s are l^.S,. » 
153.5, 159. 5, 165 . 5, 1 71.5 and 177.5. ^he width of 
each class interval is calledthe elass width. The class 
width may be found by subtracting the lower class bound- 
ary from the upper. Each class in the example has a 
class width of 6.^' It is desirable, but not neceasary, 
to have all classes of the Same width. 

A frequency^ distribution which uses class intervals 
is. called a vyrnvrW frequency d i s t r ibut ion aad the data 
in such a frequency distribut ion is called groupechfata . 
The frequency distribution given on page 2 is sometimes 
caMed an ungrouped frequency distribution. 

The grouped' frequency distribution has been obtained, 
at the cost of a certain loss of information. While the 
frequency distribution has been* obt'ained from the raw % 
data, the raw data cannot be recovered from the fre- 
quency distribution. *. For example, in the frequency 
distribution for» heights we know that fifteen numbers 
lie* between 144.5 and 150.5. But that is all we can 
tell/ The exact values of these fifteen numbers cannot 
be determined from the frequency distribution. • 

~* w — — 

< Exercise 1. » Forty students in a chemistry course did a laboratory 
experiment to determine the pH of a solution. The results are 
recorded below. 



8 - 00 8.15 8.10 8.15 8.05 

8.20 8.00 7.95 8.05 8.15 

8-05 * 8.10 8.10 8.15 8.25 ' 

8-20 8.10 8.30 8.15 8.20 

8-05 8.15 8.00 8.20 8.10 * 

8*25 8.30 8.15 8.20 8 10 

8.05 8.25 . 8.05 - . 8.15 s!oO 

8.10 8.05 8.15 8.25 g.05 



a. Construct a frequency distribution for these data in whr? 
each class consists of single value. 

b. Construct a grouped frequency distribution for these data in 
which the boundaries of the first class are 7.895 and 7.995. 
Use classes of equal width. 

Exercise 2. Th'irty laboratory rats are run through a maze. The 
time required to complete the maze on the first run is recorded 
below for each rat. The times are in seconds. 
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Construct^ frequency distribution for these data. 



2.2 Hi s.tofi»rams ■ • 

A picture is worth a thousand words. If this is so 
then it makes sense to find a pictorial method of 

presenting data. The histogram is such a method. The' 

histogram below is based on 'the frequency d j s tri^ut i*)n 

for height data^on page 4. ^ 

HEIGHTS & 17 YEAR-OLD FEMALES 
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H^GHT. (in cm) 
Figure, 1. Histogram of height data.' 
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On the horizontal axis in Figure 1 we see the 
class boundaries from the frequency distribution on 
page 4 J On the vertical axis we see class frequencies. 
JThe areas of the rectangles in the histogram must be 
proportional to the frequencies of the classes which 
they represent, 'If, as iii pur* example , ^11 classes 
have the same class width then,, the area of each rec- 
tangle is proportional to its height. In this case 
the height of each rectangle may^ be thought'of as 
representing the frequency of the corresponding class*. 
The use of a vertical* axis for frequencies is, in this 
case, desirable end recommended . However, should the* 
frequency distribution contain classes of varying 
widths *then a vertical axis for frequencies . is* impos- 
sible andmust-zfoe avoided, (See the solution t9 
B Exercise,4, below, for an example of a histogram with 
unequal class widths,) - ( tf ~ 

Exercise 3 , Draw a histogram for the frequency distribution in 
Exercise 1, part b orTpage 6, 

Exercise 4 , Draw a histogram for the frequency distribution in 
Exercise 2 on page 6, 



^ 3.. MEASURES OF LOCATION - ANOTHER METHOD OF 
* , SUMMARIZING DATA ~ 

/ 

* In many cases an even more drastic summary of the 
data is required. For example, we might seek a ^single 
number that can be thought of as representative of the 
entire set of da;ta. Such numbers are called averages , 
or measures of location^ or measures of central tend" 
enoy,* or measures of position. We shall call them 
measures of < location. This 'conveys the important jdea 
that such measures ^ell us where' the datra are, or, 
equivalently , how large the data are, -At the same time 
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it avoids^the word "average" to which some people are ' 
prone to give improper interpretations, • f 

There are many* measures of location. In this 
section we will discuss three of the most useful: the 
mean, the median and the mode.' Each of these may be • 
thought of as, in some sense, locating the center of -the 
data, 

5.1 The Arithmetic; Mean < T 1 

i The most common measure of location, the one most 
people are thinking of When they say "the average of 
these numbers is such.-and- such fl , is the arithmetic} mean. 
Although there are other means than the arithmetijc mean 
(for example: the geometric mean or the harmonic mean) 
when the word mean is used alone it i,s safe to assume 
that the arithmetic mean is the m^an to which we are 
refer ring 

3.1,1 Computing the Mean from Raw Data 

The arithmetic mean is- the number obtained by 
adding all of the numbers together and dividing tfris 
sum by the number of numbers,' For example, the mean 
of 6, 11, 7 apd 5 is (6 + 11 + 7 + 5)/4 = 29/4 = 7.25, 

If the variable x is used to represent the individ- 
ual numbers in the data set, then x is used as a symbol 
for the mean. If the variable y were used to represent 
the individual numbers then y would be the mean of 
thesi numbers, and similarly for other variable names, 

* Let^us' use n to represent the number of numbers in 
a set of data, . If we use x to represent the individual 
numbers then Zx will be used to represent the sum of 
the numbers;* Then we have' the following formula for jj^ 
the mean: 

§» 
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For example, if the data set consists r of the num- 
bers 6.2, 5.8, 2.9, 3.3 and 4.1 then n = 5, -Ex = 22.3 
and 



4.46, 



For the data on page 1, n = 180, Ex 



28900 and 



x' - -y$Q 160 9 160.6. 

The symbol M = M indicates approximate equality and is 
used here to indicate that the final answer has been 
rounded. 

3.1.2 Computing the Mean from a Frequency Distribution 

Sometimes the data are available to us only in the 
form of a frequency distribution. ' Thus it is necessary 
for us to have a method for calculating the mean from a 
frequency distribution. If the frequency distribution 
-has only one value in each class, we use the following 
method: j ^ ^ 

a. Multiply each value, by the corresponding 
frequency and add 'the products. i 

b. Add> the frequencies to obtain n.* 

c. Divide the first number by the second* 

This method is illustrated below using the tfbrld 
Series data from page 4., 



\ 



NUMBER OF GAMES 



FREQUENCY? 
f 

• 

11 
10 

. 11 
24 



f » 



44 

50 
66 
168 



If = 56 l(\ 

328 
56 



f) = 328 



5.9 
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This method can be expressed as a formula: 

9 

_ _ S(x : f) 

If the classes in the frequency distribution are 
intervals rather- than individual values it is not 
possibld to compute the mean exactly. This is because 
we cann6t determine the exact value of each piece of 
data. It is,, however, possible to make a very good 

approximation of the mean. 

/' 

1 The sum of the numbers in each interval can be 
found approximately by| mul tiplying the class frequency 
by thje class midpoint. i Thus the mean may be 'approxi- 
mated by using tj\e same* formula as before: 

I 

T . I(x - f) * 

But now the x on the right hand side represents the mid- 
point of the class. The next example illustrates the 
use of this formula for" the height data from the 
frequency distribution on page 4. 



HEIGHT 


FREQUENCY 


> CLASS MARK 




(in cm) 


*f 


X 


x - f 


*144.5— 150.5 


15 


147.5 


m 2212.5 


150.5—156.5 


28 , 


153.5 


4298.0 


156.5—162.5 


73 . 


159.5 


11643.5 


. 162.5—168.5 


1 44 


165.5 


72S2.0 


168.5—174.5 


15 


171.5 


2572.5 


174.5--180.5 


5 


177.5 


887.5 




If = 1 80 j 


I(x 


• f) = 28896.0 


" _ 28896 








t X i 180 

} 


= 160.5. 




A 



How does this answer compare with the value of x obtain- 
ed from the raw data? Can you account for the differ- 
ence? ^ > 

) 
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-3„1.3 Properties of the Mean * 

The advantages of the mean as a measure of loca- 
tion include: ; 

a. It is the most commonly used measure of loca- 
tion and thus is familiar to many people. 

b. It is restively easy to compute. 

c- It; lends itself to algebraic manipulation. 

d. Each number in the data set has as effect on 41 
* the mean. 

e. The mean is the most stable measure of loca- 
tion under repeated sampling. ' 

The last statement above requires some explanation. 
As we become more knowledgeable about statistics we find 
that the data which we have, in hand, called a sample, is 
often just a fraction of some larger set of data called 
a population. It is of central importance to use the 
data in the. sample to draw inferences about the popula- 
tion. The study of how this is done is called inferen- 
tial statistics. One of the reasons that the mean, is 
of*ten used in drawing inferences is 'that the varia-* 
bility of the mean among several samples is less tKan 
the variability of other measures of- location. This fc is 
what we mean when we say "the mean is the most stable 
measure of location under repeated sampling." 

The chief disadvantage of the mean as. a measure cf 
location is that it is unduly affected by extreme 
values. For example, the mean of 6,-7, 500 and 3 is 
129, which does not seem representative of the original 
numbers. 



Exercise 5. Compute the mean- of the data given in Exercise 1 on ' 
page 5. * • < 

Exercise 6 . Compute the mean of the data~given in Exercise 1 on 
page .5 from the ungrouped frequency distribution obtained in part 
a of that exercise. r 
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Exercise 7. Approximate the mean of the data given in Exercise 
1 on page 5 from the grouped frequency distribution obtained in 
part b of that exercise. 

Exercise 8 . Compare the results of Exercises S, 6, and 7. 

Exercise 9. Compute the mean-of the data given in Exercise 2 on 
page 6. 

Exercise 10, Approximate\the mean of the data in Exercise 2 on 
page. 6 from the frequency distribution obtained in that exercise. 



3.2 The Median 



For a given set of data, a number which is greater 
than half of the data. and less than the other half 
would be a useful meas.ure of location. In practice ' 
there may be no such number. For example, if the 
numbers in the data set are 3, 4 , and 5 then the number 
in the middle i^s 4, But only one-third of the data are 
smaller than 4. In order to insure* that the measure 
we are defining will always exist we must make a 
slightly more elaborate definition.. 

V *■ 
The median o'f a set of data is a numJber which: 
a) is not greater than more <than half of the data, - 
and b) is. not less than more than half of the 4ata. 
If the variable x is used to represent the ' individual 
numbers in the data set then x will be used to repre- 
sent the median. 

3.2.1 Computing the Median from Raw Data 

* To calculate the median it is % first. necessary to 
rank the data from smallest to largest. The median is 
then the "number in t<he middle." 

If>n_> the_ number^of ^number-s, 4s odd then this num- 
ber^ the middle is easy' to find.. For example, to 
find the median of 11^ 17, 12, 23 and 13 we rank the 
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data. (11, 12, *13, 17, 23) and observe that the number 
in the middle is 13. This is the median. # 

If n is even then there is a small problem. If, 
* for example, the ranked data are 7, 9, 10, 15, 18 and " 
20 then any number between 10 and 15 satisfies the 
definition of the median. To be technically correct 
we should speak of a median rather than the median. 
But this ambiguity is, avoided if wp define the 'median 
in this case to be the mean of the two numbers in the 
middle of the ranked data. Bj^this agreement the 
median of 7, 9, 10, 15, 18 and 20 is 

- . 10_5_15- = 12 . 5 ; 

« 

In both- examples above, no matter whether n is 
^euen'or odd, the median is the number in the ^(n+1) 
\ position in the ranked data. When n was 5, *s(n+l) was 
3 and the median_was thfifhirri number in the ranked 
data. When n was 6>, h{n+l) was 3% and the median was 
hal f way~ Fe fweeriTKe third and fourth numbers in the 
ranked data. Thus the procedure for finding the median 
from raw data may* be summarized as follows: . 

a. Rank the data. ^ 

b. Find the number in the *s(n+l) position in 
the ranked data. , " / 

3.2.2 Computing the Median from - 
" ' / a Frequency Distribution * \ 

If *the data are available to us in a frequency 
distribution then the data have, in effect, been 
ranked. If each class in the distribution contains a 
single value we need only determine the position of the 
jnedlTan and find the number in that position. 

* ' v 

-„ ' . For example, in the distribution of height data on 

page 2, n = 180. Thus the position of the -median ,is 
Jj(l8l) * 90.5, or halfway between the 90th and 91st 
numbers. Adding the frequencies £rom the first class 
onward we find that 77 numbers are in the classes up *to 
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and including 159 cm. and 96 of the numbers are in the 
'classes up to and including 160 cm. Thus both the 90th 
and 91st numbers are equal to 160 cm. and the median 

x ■ 160 ; 160 » i6o. 

If the classes in the frequency distribution 'are 
intervals then, as with the mean, we cannot calculate 
x exactly, but* only approximate It. The procedure used 
to approximate the median is as follows: 

a. Find the position of the median: %(n+l). 

b. Find the class which contains the median., 

c. Use the formula 

r *s(n+l) - S 



x = L + 



T 

where: L = lower boundary of the class containing the 
median . 

S = sum of frequencies for classes lower than 
the class containing the median. 

f = frequency of the class containing the 
mediran.^ 

w = width of the class containing the median. 

Applying this rule to the grouped data on heights 
on page 4 we .find: 

a. Position of the median ^(181) = 90.5. 

b. The median is in the third class (156.5-162.5) 

c. L = 156.5, S = 15 + 28 = 43, f = 73, w -6, 

x = 156.5 + ( 9 ° - '^~ 43 ] 6 = 156\5 + 3.9 p 160.4. 

This answer compares favorably with the e^act result, 
160, obtained above. * 

3.2.3 Properties of the Median - 

The median has the fallowing advantages: 

a. It is an easily understood measureXof' \ 
'location. ^ 

b. It is not affected by\extreme Values £nd 
thus is sometimes more typical of the 

.numbers in the^data «set than is tl^mean. 



~^r^T;^-^^l7/ .there'' ' -\*'~ 

■ * ^ "--^v..^ ^^*3|Yj^^ .: * 'Such ' , ~7 - 




: -^i^^i^Xul^^^^: that . ' > ^ 
* v the m&.dian<-.coui^^.^^>> / - ^ '* X ^ 

d; rln many frVquwe"^ ^ 

" "* - ^ ' ^?^'"^^ " N 

children a-"-fiftttiiy: the' tojr^lass might - 

pofnt- o^tiijls top ^tA^-^^^-i\^6^x ; 
to ,^^roxim^e'tSei5yan ; 'of ;,'suc"h* 
** But since the top iela^s 'ls- nat. usually 
involved in t-he'p^qcess of f-i-rtding the 
median", it may be found" as; before.' '■ 

The chief disadvantage of the median is that it 
does not lend itself to algebraic . manipulation as 
readily as does the mean. We 'might -,al so regard the 
necessity to rank the data as a disadvantage. For 
lar&a sets t of data the ranking procedure as time "con- . 
suming, even if done on a computer. 

Exercise .11 . Compute the median of the data on World Series 

games given in the frequency distribution on page 4. 

Exercise *1 2 . Compute the median of the data in Exercise 1 on 
•page 5, Compare x with x for this data set. 
, Exercise 13 . Compute the median of the data in Exercise 1 on 

page 5 from the frequency distribution constructed in part b of 

that exercise. - Compare this with the-result obtairied in Exercise - 

12. « * * 

Exercise 14 . Compute the median of the^.data in Exercise 2' on 

page 6. Compare x with x for this data set. 
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-^V- 'Jhe-tfode of a set" of n'u^eH'.is simply "the nXyribeF ' 

- wjuckjappr^ars more frequently 1 5^11" "any "other s' For 
~£xample,-;in~the data set ^presente^&n page 1" Xand '- ; '-1 * 
'■agai'n o'n ; j)a^^^> the _-idade.. is 7\ 

the'hi^eVs T-n. the d at 1? set are distinct- \ 
" then, there is -no' mode.^ l^vcn" when~*there is a mode it - 
: :ra^"5e" of jip''^ data set cdn>^ 

^i's't^ of rDO yalues ,* w4-th two of tKese-lreing equal and. 

the remainder distiijctVit is unlikely to be, of any use 

- to' TTo^fe.tha.tjthe value whictf occurs twice is the mode 
t . ** - ~ ' ^ i 

V^-? n .-the |other hand, if the mode represents some 
"relatively large fr/a^ttion of the -data, it is useful to 
>eport"-rt. In the data' on World Series games on page 
/Ajjfp 'see* that nearly half of the World Series have 
taken seven.games to complete. This is a*-n interesting 
feature of the data. Thus it makes some sense to men- 
tion this if the four class frequencies had been 11, 
10, J* and 12.. The importance of the mode as a measure 
of location is directly related to the relative fre- 
quency of this value: The larger the fraction of* the 
data represented by " the mode, the more important the 
mode becomesi . 

i 

Sometimes a data set wilf have two values which 
"^xccur much mkre frequently than" the others. For ex- 
ample, the salaries .of employees of a business might 
fall mainly into two categories, Jow salaries for 
laborers and higher salaries for management personnel. 
Such a, data set is said to have two modes, even if the 
frequency for one mode is somewhat larger than for the 
other. Such data may also be described as .bimetal . 
It is appropriate to'report both modes for bimocTal data. 

Jtf the data are in a grouped frequency distribu- 
tion, we may choose the class with the largest frequency 
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and call this the modal class. Alternatively, the mid 
point of the modal class may be sported as the mode. 

40i short, if one or two values, or intervals, 
represent a relatively large^f^action of the data then 
^this is interesting and should be mentioned when-de- 
scribing the data. Otherwise we should no,t use the ' 
mode as a measure of location. 
* 



CHOOSING A MEASURE 0F LOCATION 



, Now that we have three measures of location at 
our disposal, which one should we use? The answer to 
this question depends both on the data set itself and 
on the use we intend to make of the measure of loca- 
tion once it has been found. If our purpose is s 1 imply 
to describe the data effectively we should use what- 
ever measure or measures are suggested by the data. 

The shape of the histogram of a data set is useful 
in deciding whaT* measure to use. Four possibilities 
.are illustrated in Figure 2. 



0>) . 
Positively Skewed 




(c) 

Negatively Skewed 



■Figure 2, 
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If the histogram! of the data- is approximately 
symmetric, as in Figure 2a,' then the mean and the 
median will be apprpximately equal. If the histo- 
gram is approximately symmetric and has a single modal 
class then the mean, median and mode are all approxi- 
mately equal. If the data are concentrated toward 
the lower end of the range with a few larger values, 
as in % Figure 2b, then we say the data are positively- 
skewed. ,The reverse case, illustrated in Figure 2c, 
is referred to as negatively skewed data. The more 
the data are skewed, the greater will be the differ- 
ence between the mean and the median. - v 

The histrogram on page 6, which represents the 
height data given On «page 1, is approximately symmetric. 
For this data set the mean was 160.6, the median was 160 
and the mode was 160. The'data set summarized in the 
frequency distribution below is negatively skewed*. 

Class , Frequency 

0.5—100.5 3 
100.5--200.5 2 
200.5--300.5 7 

300.5— 4Q0. 5 . 24 • 

400.5 — 500.5 52 

For this, data set x = 387 , x" = 417 *and the midpoint of 
the modal class is 450.5. 

The outstanding characteristic of the data repre- 
sented by the histogram in Figure 2d is that it is 
biomodal. Th.is_fact should be included in any descrip- 
tion of the d&ta. 

If we intend to follow the calculation of the mea- 
sure df. location with further^statistical computations 
then this fact must be considered when choosing the 
measure of location. The great majority of statis- 
tical tests and procedures are designed to use the mean 
rather than some other measure of location. Hence 
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there is a strong inclination to choose the mean in 
those cases where further statistical investigation is 
ant icipa t ed. 

With these facts in mind we list below some sug- 
gestions , 

♦1. In general, use the mean. It is the most 
commonly used measure., It is especially appropriate 
if you expect to do further statistical computations. 

2. If the data are highly, skewed, use the median. 
The median is, in general, less affected by a small 
number of very extreme values than is the mean. 

3. If the data are in a frequency distribution 
which uses an open-ended interval*, use the median. 

■ 4. if the data have a pronounced mode, mention 

this fact. If the data have two pronouirced modes, 

mention this also. 1 x 1 

* » 
5. There is*no law -which forbids you to , report 

mQre than one measure of location. 

Exercise 15 . < %he frequency distribution below, taken from the 
1978 edition, of the Statistical Abstract of the United States, 

gives adjusted gross incomes as reported on individual income 

j * 
tax returns in 1976. Which measure of location is most appro- 
priate for these data, and why? ' ' --^^ B 

ADJUSTED GROSS INCOME NUMBER OF TAXPAYERS 

1 (IN DOLLARS) (IN THOUSANDS) N - 

O.to 3,000 * -15,015 

. . 3,000 to 5,000 8,837 

5,000 to 10,000 » > \ 19,891 

10\000 to 15,0.00* * ' 14 *182 

15^000 to 20,00(T * ' 11,182 

" % 20,000 to 25,000 6^662 

. ' 25,000 to-, 30,000 ' . 3,611 

t 30,000 to*. 50,000 • , 3,632 

.50,000 to 100,000 945 

100,000 to 500,000 • 221 

500,000 to 1,000,000 , 4 

over 1,000,000 , \ 



Exercise 16 , Suppose that two^hundred film reviewers were asked 
to choose, from among the five films listed below, their favorite. 
Suppose further that the responses were as indicated. What mea- 
sure of location is most appropriate for these data and Why? 

' PICTURE NUMBER 

High Noon 2"^ 
The Godfather 55 
Gone With the Wind 90 
" The Sound of Music 8 * 

Casablanca ,40 * 

0 

Exercise 17 . On an opinionnaire 450 people were asked to state 
whether they "strongly agree," "agree," are "neutral," "disagree" 
or "strongly disagree" with the following statement: "Gas 
rationing is one good way to 'deal with the energy shortage." 
•The results of this (hypothetical) poll are presented below. 
Which measure of location is appropriate for these data and why? • 

RESPONSE NUMBER 

Strongly Agree 54 

Agfree 97 

Neutral 150 

Disagree 103 

Strongly Disagree 46 

Exercise 18- . The grades of thirty high school students on a 
French examination are recorded belov^,. Which measure of location 
is appropriate for these da£a and why ? 



80 


- 84 


79 


81 


75 


68 


76 


72 


90 ■ 


96 


85 


86 


38 


85 


70 * 


92 


87 


, 90 


80 


80 


72 


73 


84 


91 


64 . 


76 


71 


.76 


81 


68 



Exercise 19 . What measure ^f location would be* appropriate for 
'the data given in Pxercise 1 on page 5, and why? 

Exercise 20 . What measure of location woulcf be appropriate for 
the data given in -Exercise- 2 on page 6, and why? 
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* 5, PERCENTILES, DECILES AfflTptiARTILES ' 

The measures discussed in this section are mea- 
^surfcs of location or position, but are not properly 
described.as measures of central tendency. These are 
the percentile scores, decile scores and quartile 
"scores. Percentiles will be described in-detail. 4 
Deciles and quartiles may be thought of a special cases 
6f percentiles. ' « 

5.}.. Percentiles' r , 

Percentiles are ^defined and computed in a. matter 
analogous # to the median. As with the median, care must 
b« tSIcen w t6" insure that percentiles exist and are 
uniqae. To begin with an example, Jthe eightieth per- • 
aentile, denoted by P 80 , may be* thought of as a number 
which rs larger -th^n 80S of the data' and smaller. than 
201 of the data. Similarly, the thirty-fifth percen- 
tile, pj^ may be thought of as a number which is 
larger than 351 of the data and smaller than 651 of the 
data. *t*The formal definition is given below. 

* 'If r is any number from/* to 99 then the rtji per- 
centile for a set of data Ls a number, P r , such that' 



a^-raost rl of the data are less t^han P r and at most # 
(100. r )l of the data are greater than P r . ( . # 

5.2 Computing Percentiles 

The method for finding a percentile score is very 
similar to that for finding the median-. In fact you 
may have already noticed that the fiftieth percentile 
and the median are identical, to find the rth pe.rcen- 
- tile: 



Rank the data* ^ 



Find the number in the ^(n+l) position, in 
the ranked data. 
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I Suppose -for example that we wish to find the 84th 
percentile score for the height data given on page 1. 
The data have been ranked in the frequency distribution 
on page 2 . 



The position of P g4 is 



/ 



Thus p 84 is^ between the 152nj and 1 55rd numbers, in 'the 
, ranked data. To avoid ambiguity we will take P 0 , to be 
four one-hundreds cf the way between these two numbers. 
That is \> ^ j , 

P 84 = 152nd number + 0.04 (155rd number - 152nd number). 
Counting through the frequency distribution from the 
smallest class we find that the 152nd number is 167 and 
the 153rd^ number is 168. Thus 

P 84 - 167 + 0. 04 (168 - 167) = 167 + 0.04 = 160)4, 
- /~< | , 

Exercised Find P 24 and P ys foil the height data on page 2. 



f 



If the data are given ink frequency distribution 
with class intervals then the method for finding P r is 
similar to -the method for-finding the median given on 
page. 12^ The position of P r is, as before, 

TTO( n+1) ' - . 

First we find the class containing tnis number,, hid 
then we define P r by 



r 



(n+1) - A 



J P r = L + 

where: L = lower limit of the class containing P T 

< S *~ 'sum of frequencies for classes lower than 
the class containing P * 
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f s frequency of the class containg P r 
w - width of the class containing P r ♦ 



Exercise 22 . .Compute P 30 and P g9 for the height data in the fre- 
quency distribution on page 4. 

i 

5.3 Deciles and Quartiles 

The median divides the data into halves. The per- 
centile? divide the data into hundredths* Similarly, 
the deciles divide the data into tenths *and the quar- 
tiles divide the data ^nto quarters. The sixth % deci le , 
denoted D 6 , is that number such that six-tenths of the 
data are less than D£ J The third quartile , Q 3 ,°is that 
number such that three\- quarters of the data are less 
than Q 3 . Etc. 

It is not necessary to present methods for findi'ng 
quartile and decile scores' as these may be found by 
computing the corresponding percentile scores. 

°1 = P 10 - ^ o * = P. 



'3 / 30 
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D 2 = P 20 ? D 4 = P 4Q 

D 5 = ^2 " P 50 3 5 * 
D = P • 

Q 3 = P 75 

D = P 

7 K 70 , D * = P r 
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~ 6, MODEL EXAM / 

c 

1. Compute the mean and the median of the data below: 

8.1 v 9.0 7.5 6.9 9.0 

11.3 10.9 8.4 8.3 9.6 

7.9 12.5 11.0 10.6 10.5 . 

a 

2. Construct a frequency distribution for the following set of 
data using 130.5 as the lower boundary of the first class and 
having all classes of i&dth 15. 



189 


233 


180 


181 


200 


216 


215 


190 


141 


165 


193 


201 


177 


217 


175 


168 , 


138 


149 


199 


223 


143 


148 


203 


185 


183 


192 . 


163 


168 


166 


177 


140 


193 


230 


181 


173 


201 


136 


158 " 


174 


195 



3. Compute the mean and the median of the data in problem number 
j two from the frequency distribution. 

4. Compute Q 3 , D 4 and from the ^w data in problem number 
twp . f 

X 

5/ What are 'positively-skewed' ^fea? 

6. When is the mode an important measure which should be reported? 

\ 
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7. ANSWERS TO EXERCISES 



£H 

7.95 
8.00 
8.05 
8.10 
8.15 
8.20 
8.25 
8.30 



1 
4 

8' 
7 
9 
5 
4 
,2 



Class 
boundaries 

7.895-7.995 
7.995-8.095 
8.095-8.195 
8.195-8.295 
8.295-8.395 



1 

12 
16 
9 
2 



/ 



The frequency distribution you obtain depends upon your choice 
of classes. One possible result is shown below. 



TIME 




(in sec.) 


Jf 


9.95-14.95 


11 


14.95-19.95 


10 


19.95-24.95 


3 


24.95-29.95 


1 


?9. 95- 39. 95 


3 


39.95-49.95 


1 


4^.95-59.95 


1 



g 10 
o 

3 

or 

• 5 



— -v- 

o r 



tn - tn 



o 



Your result here depends on your choice of class intervals 
back in Exercise 2. If you, as I did; chose intervals of 
varying widths, remember that in" a histogram it is the area 
of the rectangle and not its height, which is proportional to 
the frequency. Note in addition that a vertical axis for 
frequency is not possible when the classes are of varying 



widths. The numbers inside parenthesis on this histogram in- 
dicate the frequencies of 'the classes. 



(11) 



(10) 



(3) 



TT) 



(3) 
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5. n = 40, Zx = 325.00, x = ^ = 8.125.' 



to 



to 



6. x 


f 






7.95 


1 


7.95 




8.00 


4 


32.00 


£f.= 


8.05 


80- 


64.40' 


8.10 


7 


,56.70 


r(x 


8.15 


* 9 


73.. 35 




8.20 


5 


4£l00 


x" = 


8.25 


4 . 


33^00 




8.30 


. 2 


16.60 





7. 



CLASS 

7.S95-7.995 
7. 995-8.^95 
8. 095-8. 195" 
8,195-8.295 
8.295-8.395 



Zf = 40 Z(x - f) = 325.00 

f x . 

•1 7.945 

12 8.045 

16' 8.145 

9 . 8.245 

2 8.345 



40 



8.125 



Zf = 40 



x*= 325 - 7 



40 



E(x - f) 

t ' 

8.1425 



7.945 
96.S40 
130.320 
74.205 
16.690 

325.700 



8. The mean obtained in Exercise 6 agrees exactly with the mean 
obtained in Exercise 5, as it should. The mean, of these dai 
is 8.125. The mean obtained^ in Exercise 7 is only an approxi- 
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raation to the, true mean. This loss of exactness is caused by 
the loss of information which occurs when the data are grouped 
into class intervals. Notice that the error of approximation 
is not large. 

•9. n = 30, Ix = 618.2, x = 20.61. 

10. 



CLASS 


f 


x *' 


X 


f 


9. 95-14. 9S 


11 


12.45 


136 


95 


14.95-19.95 * 


10 


17.45 


174 


sa 


19.96-24.95 


3 


22.45 


"» 67 


35 


24.95-29.95 


1 


27.45 


* £1 


45 


29.95-39.95 


3 


34.95 


* 104 


85 


39.95-49.95 


"l 


44.95 


44 


95 


49.95-59.95 


_L 


54.95 


54 


95 




If = 30 




£(x f) =611 


00 



20.67. 

The answer to this exercise depends upon your choice of ' 
class intervals in -ExeTcise 2. * 

11. n = If* - *56. Position. of x =^(n+l) = h($7) = 28.5. 
^ The 28th and 29th numbers are bo'th s 6 v . Hence x = 6. 

4. 

'12. The'flat-a have already been ranked in Exercise 1, part a. 

*n = # 40. The position of x = *s(40+l) = 20.5. Ttte" 20th '. 
number is 8.10 and the 21st is 8.15. Thus^ x = 
(8.10 +* 8.15)/2 = 8.125. . • 

, % We note that the mean and the" median are equal. Al- 
though exact equality is something of a coincidence, the 
mean and the median of a data set will .be approximately 
equal whenever the histogram of the. data is symmetric. This 
point will be discussed further in Section 4. 

-13. The position of the median is 20.5, as in Exercise 12. ^he 
median is in the third class^ L = 8.095, S = 1 + 12 - 13, 

y f = 16, w = o.io. 

i. L ♦ (*2^J V= «.09S ♦ (^irJi} 0.10 = 
8.0$fe + 0.047 = 8.142. 
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The approximate value of the median obtained here is reason- 
ably close to the true value obtained in Exercise 12. 

First we rank the data: '10.7, 10.8,. 11.6, 11.8, 13.1, 13.3, 
13.9, 14.0, 14.1, 14.4, f4.8, 15.5/ 15.7, 15.9, 16.0, 16.1, 
16.9, 17.5, 17.7, 18.3, 19.8, 20.3, 21.3, 23.2, 29.8, 34.6, 
38.3, 39.7, 42;9, 56.2. The position of x = l «(30+l) = 15.5. 
I he I"5th number is 16.0 and *he 16th number is 16.1. Thus 
x = (16.0 + 16.1)/2 = 16.05. The mean for these data was . ^ 
20.61, which 'is markedly larger than tiie median. 

There are two reasons- to choose the median as the measure of 
location for these data. One is that th,fe data are positively 
skewed, as is usually the case with income data ; The other 
is that the last class is oben-ended, which prevents the 
caiculat^n of the mean unl/ess we are willing to guess at 
an average value (midpoint^ for this class. 

#° . The data also seem to be bimodal , but not to a remark- 
able degree. 

tn$this example the categories are got numerical. In fact © 
they ane not even ordered. Thus^heither the mean nor the 
median can be used. This^leaves the mode. Fortunately 
tffere is a pronounced mode: GonfWith Tne Wind received the 
vote fk almost half of the people poHed. m % 

As In, Bxercise 16, the categories are not numerical. Thus 
the mean is not a candidate ^§or the measure of location. 

e 

^The categories are, however," ordered. With such ordinal 
data the median may be used. ' The position of the mecfian is 
^(45^+1) = 225.5. The median^ response is "neutral." This is 
also the modal, response. It seems that this accurately re- 
flects the fact that according to -these /responses, opinion 
on tl}is question i^ rather evenly divided. . 



18. 



A frequency distribution and histrogram for thia data set are 
shown below. 



CLASS 

60.5-66.5 
66,5-72.5 
72.5-78.5 
7'8.5-84.5 
84.5-90.5 
90.5-96.5 




19. 

20. 
21. 
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The histogram above indicates that there is nothing about 
^t\is data set to indicate we should use a measure other than 
'the mean. Thus we choose the 'mean. 

As in "Exercise 18, we choose the mean because there seems to 
be no strong reason to do otherwise. 

Choose the median because the data are positively skewed. 
Position of P 24 = |1-(180 + 1) = 43.44. The 43rd number is 
156 and the 44th number is 157. Therefore P 
156 + 0.44(157-156) = 156.44. 

'Position of P ?5 = y^(180 + lj = 135.75. The l3Sth and 
136th numbers are both 165. Thus P = 165. 

Position of P 3Q = f^(180 + l) = 54.3. Thus P 3Q is in third 



24 : 



class., L = 156.5, S = is + 28 = 43, f = 73 and w = 6. 

54.3 - 43l , 

-16 = 156.5 + 0.9 = ,157.4. 



P 30 = 156 " 5 + 



73 



89 

Position of P g9 = ^(180+1) = 161.09. Thus P g9 is in 

the fifth class,. L = 168.5, S = 160, f = 15 and w = 6. 

P - 16R q * f 161. 09 - 16 0) . _ rt 

P 89 " 168 - 5 + [ 15 J 6 = 168.5 + 0.4 = 168.9. 
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8^ ANSWERS TO MODEL' EXAM 



3. 



a) n = 15, Zx 


= 141 


.5, x = 


141 . 


5 = 9.4. . 






• 




15 






b) Ranked data 


: 6. 


9, 7.5, 


7.9, 


8.1, 8.3, 8 


•4, 9.0, 


9.6, 


10.5, 


10.6, 


10 9 


11.0, 






Position of 


x = 


t(15+l) 


= 8. 


x = 9.0. 
















CLASS 




f 










130.5-145 


.5 


5 










145.5-160 


.5 


3 










160.5-175 


.5 - 


8 










175.5-190 


.5 


9 










190.5-205 


.5 


9 










205.5-220 


.5 


3 










220.5-235 


.5 


3 










I CLASS 




f 
i 




x 




x • f 


130.5-145 


5 


c 
J 




1 70 

loo 




690 


145.5-160 


5 


3 




153 




459 


160/5-175 


5 


8 




168 




1344 


175.5-190. 


5 


9 




183 




1647 


190.5-205. 


5 


9 




198 




1782 


205.5-220. 


5 


3 




213 




639 


*. 220.5-235. 


5 


_2 




228 




684 






= 40 






-•Z(x ■ f) = 


7245 . 



— 724 q 
a) x = = 181.125 = 181. 



b) Position of 


x = *s(40+l) 


=20.5 ♦ 




x- = 175.5 + 




13 = 183. 




„ *4. ' Ranked data: 








136 


163 


177 . \ 


^90 


138 


165 | 


111 


192 


1.40 


166 " 


180 ^ 


193* 


141 


168 


181 


193 


. 143 


168 


181 


195 


& \ 148 


173 , 


183 


199 


149 * 


174 ' 


185 


200 


158 


175 


189 


201 



201, 

203 

215 

216 

217 

223 

230 

233 



a) 



Position of Q 3 = |(40+1) = 30.75 
Q 3 = 199 + 0*75(200-199) = 199. 75 v 
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b) Positioir of D 4 = ^(40+1) = 16.4 
D 4 * 175 + 0.4(177-175) = 175.8. 

c) Position of P 21 = |^(40+1) = 8.61 
P 21 = 158 + 0.61(163-158) = 161.05. 

See pages 17-18. 
See page 16, 
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1.. INTRODUCTION 

1.1 Approximation in Statistics. *-; 

Approximation play^a . centraT^role ;in - the application 
.and interpretation of statistical methods. For instance, 
parametric probability representations of populations- 
fundamental tools of statistical analysis-- are usually * 
only approximations of the actual natures of the popu-' 
lations. Sampling distributions in use for these proba- 
ta list ic models are erf ten themselves approxi mat ions to 
those which are derived mathematically. " 

, There are two principal areas in which approxima- 

j ? tions are vital in formulating statistical problems,: 
in forming a convenient model of ji population wh"en the 
actual structure of the population is either very complex 
or unknown; and in developing easy, reasonably accurate 
methods of competing probabilities when exact , methods are 
cumbersome. * . „ 

We shall consider experiments consisting of|n 
"trials", wher$ each trial results in one of two possible 
out'comes (arbitrarily labeled "success" and "failure"). 
We shall* look at two probability mo.dels for "the number 
of successes in the n trials" and study ways tof calculate, 
exactly <and approximately, the probability of k successes. 
While these experiments are of ^ very special nature, 
the use of approximations, both structural and mathemati- 
cal, in this context serve to illustrate the more generai 
application of approximations. 

Before turning to approximation of probabilities, 
however, we snail look at some examples of typical nu- 
merical approximations and at.a complementary way of, 
; making cdmputation more manageable^ 



1.2 



Some^^xampl es of Numerical Approximation * 
I Suppose that, for some reason, -we wanted to know 
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about how large .7^ is, but we did not have the time or 
patience (or the computer) to do all the multiplications.^ 
Recalling the algebraic rules for exponents, we can write 

Now, .49 is approximately 1/2. We abbreviate that 

".49 - 1/2" (the symbol means "is approximately 

equal to") . So » 

7 10 _ r L5 1 . I 1 n . 

. 2 ->^> 32 $3 

Actually, . 7 10 = .0282 , to four decimal places, so the 
approximation is nearly correct. k Whether the approximation 
is close enough depends on the purpose of the calculation. 
For some applications, especially those which involve 
further computation using the results of the approximation, 
a simple approximation may not be close enough to t^ J 
value being approximated to be dependable. 

Numerical approximation may take more complex forms. 
,A frequently encountered ■ mathemat ical problem is finding 
the area unMer a curve, li^e the shaded area* in Figure 
la, JVe can approximate the area and perhaps simplify the 
computation by using a series of rec^tajigles whose total 
area nearly 66incides with the areaiunder the curve (see 
, Figure lb). The height of each retangle at its center 
is the height 'of the curve there. Some corners of the 
rectangles are above the curve (overestimating thVareaJ) 
and some are below the curve (underestimating the aT'eaj. 
If the rectangles are narrow enough, the appiroximat ion of 
the area will be quite accurate. (Students of calculus 
will recognize that the exact area is given by the defi- 
nite' integral of the function defining the curve.) 

.Some of*our probabilistic approximations will use 
the reverse of this process: we shall use the area 
under a /continuous curve (which happens to be conveniently 
tabulated) to approximate the area under a series of 
narrow rectangles. 
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There are some general strategies for designing, 
approximations; they are part of the theory of numerical 
approximation, which is an important branch of applied 
.mathematics but beyond the scope of this module. 



1.3 Exercises 

Exercise 1. Approximate the area under tho curve defined by f(x) 
=v5c between x=0 and x=l. Try the following methods and compare the 
approximate areas you compute with the exact area, 2/3. 

a) Approximate the area from below, using a straight line: 
f(x) * < 




b) Approximate the area from above, using a^traight line with 
the same ?lope as the liije in part (a): 
f(x) 




/ 



GO 



/ 



(If you know calculus, you can determine exactly tfye point at which 
the line must be tangent to the Curve and thus *the algebraic repre- 
sentation of the line. If not, you can use graph paper and a ruler 
to plot ^(x), draw the tangent line that has the proper slope, and 
es|imate its height at x=0 and x-1.) 

c ^Approximate tne area using two rectangles, with heights 
determined by the height of the curve on the right-hand sides of the 

rectangles: . « 
f(x) 




x 



d) Approximate the area using two rectangles, with heights de- 
termined by the height of the curve at the midpoints of the rectangles 



ffx) 





V 






1 - 






















Y 



0 -25 .5 .75 ' 
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e) .Approximate the area using eight rectangles cons truc^wjgj ike *. 
-those in part (c), # « 0 t '\ 

f) Approximate the area using^ight rectangles constri^ej^fike^ 
those in part (d) . * * 

g) Compare* the difference between your answers to (c) and (d) 
with'the. difference between your answer's *to *(er) and (f). 



I. A Recursive 1'ormulas, 

— — . > k 

Computing numerical Values for a mathematical ex- ? 
pression is often easier when the expression is represented 
as a recursive formula. Simply -stated , recursive formulas 

"if 

are "building blocks" which permit the definition (or 
computation) of the value of a i function at some point 
from the^ function's value at another 'point . Usually, 
some starting value 'is determined or given, and the func- 
tion is constructed from this starting value. 

For example, consider the function 

for the integers k = 0, 1, 2, .... A ^recursive repre- 
sentation of the same function could be given by 
specifying the function's value for 0,' «-« » ** 

f(0) =0 

--which is the starting value- -and the recursive formula 

- f(k+l) = f(k) + 2k + 1. 
Table I illustrates the process. 



& * ' TABLE I 
f 

ILLUSTRATION OF RECURSIVE FORMULA f(k+l) = f (k) + 2k + 1 
(EQUIVALENT TO NON-RECURSIVE FORMULA f(k) = k 2 .) 




f (k) 



2k + 1 



(starting value) 




Recursive formulas need not be additive, as our 
example was. They may involve any kind of mathematical 
computation. The recursive formulas used in our proba- 
bility calculations will call for f(k+l) to be determined 
by multiplying f(k) by several quantities. Multiplicative 
recursive formulas in particular tend to provide signifi- 
cant reduction in the complexity of computations. 

Recursive formulas can also be helpful in suggesting 
approximations which would hold for large values of one 
or more of the variables in the expression. Exercise J.3 - 
illustrates this u,se. | 



1.5 Exercises ^ , 

Exercise 2. Let f(0J = 1 and f(k+l) «= £(k) for k * 1, 2, 3, 4, 

and 5. I ^ 

fa) Show that f(k) - (^ by computing^ (k) recursively, 

computing (^) directly, and comparing the results. 

, b) Show algebraically that f(k) = (£) . Hint: Prove that 

\ 5 ) 
\Mj * 5-k 

f 5 } = k+l * . ■ - 7 

#3 



2. STRUCTURAL APPROXIMATION 

2. 1 Approximation of Hypergeometric Probabilities by 
Binomial Probabil it ie|S 

Suppose that the trials consist of sampling without 
replacement n items at random from a finite population of 
N ltems^ K of which are successes. (Sampling without re- 
placement means that an item once chosen, for inclusion in 
the4ample cannot be chosen again.) Then the exact prob- 
ability model for the number of successes is the hyper - 
geometric probabil ity distribution ; the probability that 
k successes are selected is <J 

(1) . " ({) t!S) 

h(k;N,n,K) = K N " K ■ 

(We are considering here only values of k that are less 
than K and also less than n.) , 

For example, if there are three pink grapefruits and 
four yellow grapefruits in JK>ag and three grapefruits 
are drawn at random, then the probability that exactly 
one grapefruit in tn4 sample is ^yellow (a success) and 
the other two are pink (failures) is given by' 

12 

i 3 j . — 



For this example, N = 7 (the Jtotal number of grapefruits 

in the bag), K = 4 (the number of yellow grapefruits in 

the bag), n = 3" (the number of grapefruits in the" sample), 

and*k = 1 (the number of yellow grapefruits that must 

appear in the sample to realize the evjent^we described) . 

The 'mathematical derivation of h(k;N,n,K)()is based on 

counting the total number of possible collections -of n 

items from a population of N items -- which is the de-' 

nominator, (**)' *and the number of those collections 
n ^ 

which contain exactly k successes (and n-k failures). 



■.;The latter number is the numerator, (£) ( N "£) : there are 
v k n ~ K 

( k ) ways of collecting Jc successes f rom ,among the K suc- 
cesses in the population, and for each of those ways 
there *re ways of putting together the n-k failures 

from the N-K failures in the population. " ' 

In principle, w^ cou^[d evaluate the hypergeometr ic • 
probabilities for va^-es^f N, K, n and k which should 
arise. However, for evSn moderately large values of 
these four parameters, computation of the binomial co- 
efficients is time-consuming and- tedious, and it is use- 
ful to have an approximation which involves less tedious 
calculation. 

One of tl)e. most convenient methods, for simplifying 
the evaluation of hypergeometric probabilities involves 
approximating with the binomial probability distribution. 
This distribution represents the probability of a given 
number of^successes when the results ofthe trials are 
statistically independent. If one is sampling with re- 

icement, the probability p of success on any given trial 
is kioi affected by the outcomes of previous trials. (In 
sampling with replacement, an item is "returned" to the 
population after having: been chosen for the sample? so 
'the item could be chosen again.) The .trials are indepen- * 
dent, and the binomial distribution is applicable. In the 
hypergeometric situation, if ,the population size N is 
small or if the -number of trials is an appreciable fraction 
of N, then the probabil it ies -governing the later trials 
will be noticeably dependent on the outcomes of the earlier 
trials. Even when N is la/ge,and a very small portion of 
the population is drawn, the e'xact probability that k 
successes will be chosen must be calculated from the hyper- 
geometric probability function, but the effect .of" deprcn^ 
dence is slight when N and K are large. If p Ys taken to 
be the proport ion of successes in the population - (i-.e 
p = the approximation, of the hypergeometric probabil- 

ities by binomial probabilities 
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(2) h(k;N,n,K) * b(k;n,p) = (£)p k (l-p) n - k 

is quite accurate, for N > 20 , or so. 

Although N is not large enough for the approximation 
to be valid, we can demonstrate its application to our 
previous example/.— We would' approximate h ( 1 ; 7 , 3 , 4 ) = 343 

bd;3,4/7) - C\) (V?) 1 (3/7)' 2 = .315 . 

TABLE II 

ILLUSTRATION OF THE BINOMIAL APPROXIMATION TO HYPERGEOMETRIC 

\>*£ PROBABILITIES 

N=7,nj=3, K=4 „ . , 

(N and K not large-enough for approximation to be very ~ r - 

accurate) * 1 

Number of Hypergeometric Binomial 
Successes f Probability Approximation 
* h(k;7,3.4) * b^;3,4/7) 

0 ■ .029 ' .079 

1 . I -343 .315 V 
* -514 _ .420 

3 • - -186 

Total 1.000 W 1^ 000" 



Table II shows .the exact and approximate probabilities 
for ea;c£ of the possible numbers of successes in this 
example. 

This simplification of the calculation of hypergeo- 
metric probabilities is based on consideration of the 
structures of the sampling problems in the two situations. 
When the population is large -- say, 20 or more times the 
size of the sample -- sampling withj replacement, as in the 
binomial situation, differs little from sampling without 
replacement, as in the hypergeometric situation. YoiTare 
unlikely to sjelect randomly the same item twice from a very 
large population, even y if you are replacing items after » 
sampling them. We can think of such an approximation as a 
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s t ruct ural approximation; the structures of the two problems 
are similar,, so the probability distributions are Similar* 



2,2 Exercises , » 

In performing the following exercises, try to visualize why each 
of the approximations should be as accurate (or inaccurate) as it is. 
Use a computer Or a calculator to do the calculations. Tabulating 
the hypergeometric and binomial probabi 1 1 t ieb is easier when you juse 
the recursive formulas » 

(3) Wl.N.n.K) = (k .i)~(i.gn^l) , h(k;N,n,K) 
and 

(4) b(k+l;n,p) - {k+ [ n {\\l p) b(K;n,p) ' 

after calculating h(0;N,n,k) and b(0;n,p) directly. ~ « 

Exercise 3. Tabulate the hypergeometric probability function and its 
binomial approximation for: 

a) N = 10, n = 5, K = 5 

b) N = 10, n = 5, K = 1 
C) N = 100, n> = 5, K = 50 
d) N = 100, n = 5, K = 10^ 

Exercise 4. Repeat parts (c) and (d) of Exercise 3 for n - 20 instead 
of n 5 5.. Has the quality of the approximation changed? 

Exercise 5, Rose Maybud is choosing at random six members of the 
United States House of Representatives and determining whether or 
not each^of them supports a particular bill. -Explain why this situa- 
tion is hypergeometric, and identify N, K,»n, and k. Which of their 
values can you^determine from our statement of Rose^^jactivity? 
Would tfoe. binomial approximation of the hypergeometric probabi li- 
ties be adequate? Why? 
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3. MATHfcMAT ICAL APPROXIMATION + 

3.1 ,Appr(jx ^j njytion of Binomial Probabilities Using the 
formal D i s t r l b u t" lb n 

When the number of trials n is large, even binomial 

probabilities are cumbersome to ^compute, and it helps to 

have a sLmple method of approximating them. For large 

values of n and values of p which are not too close to 

« 

zero or one, the cumulative b inom laj/'dist ribut ion dis- 
•tribution function " * / ' 

IS) ' k ; , 

B ( k ; n , p ) = E « b ( i ; n , p ) ,« 

* i = 0 . * 

may be approximated by the cumulative normal distribution^ 

function thus: ■ 

# * * 

■ (6) B(k;n,p) - * 

/np(l-p) 

The function <t>(y) is the cumulat i ve <jd i s tr ibut ion function 
of the standard normal distribution, which' ha's mean zero 

■ and variance one. To apply this approximation, you calculate 
the quantity y - (k-np V/ /np (1 -p) and refer to a table of 
the standard normal cumulative distribution function to de- 
termine approximately the probability of k or fewer succes- 
ses in the n trials . 

hor example, suppose that we .are interested in finding 
the probability of 20 or fewer successes in 56, independent 
trials, where each trial has probability .45 .of resulting 
in a success. In order to compute this quantity exactly, 
we would haVe to add up .the binomial probabilities for 21 

values of k (0, 1, 2, 20). For each k,«we would have 

[ *" 56 * 

to j^ompute the binomial coefficient raise .45 to the 

power jc, and raise .55 to the pow§r 56-k for at -leas t *com- 
{*ute^ that quantity for k = 0, ami then use the recursive 
formula (4) r^eatedly). We might find an answer in a pub- 
lished table of birtomjal distributions, but such tables do 
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not cover all possible values of n and p. A computer 
might be used to perform, the ca lcula t ions , but for values 
of n much larger than 56, even computer calculation would 
be rather time-consuming and sub j e^*--*ir"r€kUnd-of f error. 
Hence we find ZTCc normal approximation attraVtj^ve. 

To apply it, we compute 

y = ^S6(.45) . . 1b39 - 

/5b(.45)(>55) 

and refer t6 a table of the standard normal distribution 
to find that! 

B(20;56,.45) = .081 . ^ j 

By referring to a table of binomial distributions or by 
computing* we can find the exac tiCv.n intr^of B(20;56 ,.45) 
= .103, (For a better approx imat 10ft see page 15.) 

( Just as the cumulative bi^^fcil distribution may be 

i r^^mc 



^approximated by the cumulative rrWmal distribution, so 
maythe individual ( binomial probabilities be approximated 
^by the density function of the normal distribution, 

(7) brk;n,p)= 1 »( k '"P )]. 

Jfc /np(l-iJ) ' /npd-p) 

$ is the dehsity function of the standard normal distribution 

(8) , 1 -y 2 /2 

* 

We approach the privation of the normal approximation 
of binomial probabilities somewhat differently from the way 
w«j discussed the previous approximation. In that discus- 
sion, we noted the structural nature of the binomial ap- 
proximation of hypergeometric probab il i tie$ The normal 
"approximation, however, is derived from a more intrinsically 
mathematical formulat ion ,^and we consider the nature of the 
approximation to- be more mathemat ical . Tfiat is to say, 
we chose to employ this particular approximation because 
of a mathematical <j£ r i vat i° n t rather t^han an elementary 
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structural similarity between binomial sampling schemes 
and those which commonly give rise' to normally .distributed 
random ^ar iabl es . A normal random variable isj after all, 
continuous, while a binomial or hyper geometer ic random vari- 
able js discrete, and it would appear that they are not ^ • 
structural ly similar. A less immediately apparent similar- 
ity between •binomial and normal random variables is re- 
vealed,, though, by mathematical manipulation. But rather 
than being a property of these two specific distributions, 
it applies more generally to the normal distribution. 
Recall that a Central Limit Theorem states that if 

Y i y . . 7 , Y are independent random variables, each with 

1 2 
mean u and finite variance o , then for large n 

(9) ^P(Y < y) = *(rp*-) ' " 

' /IT 

k 

or equivaL&nt 1 y , s 

^PG*^ </)'-♦(/) 

for any y. , 

To apply the Central Limit Theorem to the binomial 

problem, we let Y. take^on the value 1 if the i^^triaj 

results in a success or 0 if it results in a failure. 

Then Y is the total number of successes divided by n. The 

mean of each Y is 
i 

(11) u = lyp(y) = 0 • (l-p) + 1-p = p , 

and the variance of each Y is 

^(y-rr) 2 p(y) = '(0-p) 2 (1 -p) + (l-p) 2 p = p(l-p) . 

The Central Limit Theorem states that Y is approximately 
normally distributed, so nY, the total number of successes 
in the n trials, is .also approximately normally distributed. 
You should verify thjat the theorem as stated here leads to 
the normal approximation given above for binomial proba- 
b i 1 it ies . I I 

The difference in the application of the two types of 

70 
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approxiiaat ion- - structural and mathema t ica 1 - - is therefore 
more conceptual than practical. 

3.2 Accuracy of the Normal Approximation 

The normal approxiirtat ion to the binomial dibtribu- 
tion is quite accurate for situations in v,hich there, are 
both large values of'n and values of p not too close to 
zero or one. Most statisticians regard the approximation 
as sat is factory' whenever npil-p) i* greater than 5, Mien 
this -condition is violated, one of two alternative ap- 
proximations may be applicable. 

5.5 The Continuity Correction to the Normal Approximation 

The first alternative approximation is*a refinement 

of the normal approximation. It involves the use of a 

"continuity correction". Instead of finding' $(y) for 

y = „ , n P ^ we evaluate rt— for a slightly different y: 
/npf 1 -p) " , 

ll3) B(k;n,p) » <M k ; n P * ; 5 ) 

/np(l-p) 

t 

In effect, this modification assigns to k half the proba- 
bility between k and k+1 in the normal approximation. (See 
Figure 2.) Although it generally improves the accuracy of 
the normal approximation, this refinement is less impor- ' 
tant for larger n, since the effect on y. of the added 1/2 
diminishes as n increases. (Compare Exercise 1, part (g) . ) 
The continuity correction extends the validity of tfye nor- 
mal approximation to considerably smaller n. ' 



To illustrate the application of the continuity cor- 
rection, we take another look at the -example of Section 
3.1. The value of y would now be 

y * ( *0-S6(.4S) t .51. . K262 
/56(.45)(.55) 



and 



B(20;56,.45) * *(- 1 . 262) = .103 . ' 



Notice that this value is .the same as the exact value to 
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three decimal places- - mucti^c^ose r than the approximation 
(.081) which was obtained without using the continujty 
correc t ion . 

'k 



5 6 7 8 9 10 U 12 <3 M 15 16 17 1819 
I I 'A 




Normal approxi- 
mation without 
con t inu i ty 
cortect l on 



Normal approxi- 
mation wi th 
continuity cor- 
rection (curve 
shifted one- 
half /unit to 
left) 



Exact binomial 
probabi li ties 
b(k;20, .6) 



Figure 2. Normal ap*proximations to binomial probabilities 
for n = 20, p = .6. (Area between lines under 
curve is probability assigned to k successes.) 

no 
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3.4 Approximation of binomial Probabilities by Poisson 
x ^ Probabi 1 1 ties 

Jhe second alternative approximation may be applicable 
when values of p are very small (near zero) or large (near 
one). We need consider only /small values of p; if p is' 
large, we can interchange our definitions of "success" 
and "failure" and apply the discussion below. {We can 
make the exchange because "success" and "failure" are 
arbitrary designations, and it will suffice because a very 
large probability of "success" implies a very small proba- 
bility of "failure".) 

% When n is fairly large, p is small, and np *is moderate 

(perhaps somewhere between 0.5 and 5), the probability of 

k successes in n trials may be approximated by the Poisson 
probability distribution : 

(14) b(k;n/p) =P ( k; np) - < n P>* k f'" P 

The values of p(k;np)' are easily computed with a calculator 
or by a computer. . , 

In illustrating the Poisson approximation, we shall 
suppose that we want to obtain an approximation of the 
probability of no successes or one. success m one hundred 
independent trials, each trial with probability of success 
.02. To apply the Poisson approximation, we find np = 
100(.02) = 2 and compute the approximations of the proba- 
bilities of zero successes and one success, obtaining 

B(L;100,.02)^= b(0;100,.02) ♦ b(0;100,.02) 
- p(0;2) ♦ p(l;2) 

= I L_ , 2 e / 

0! n / 

j = .406 

The exact probability, computed from the binomial distri- 
bution, -is .403; the uncorrected normal ' approximation is 
.258 , and the corrected normal approximation's .361. In 
this example, the Poisson approximation is considerably 
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more accurate than either of the normal approximations. 

The basis for the continuity- correction is essen- 
tially mathematical- -it exploits the particular way in 
which binomial probabilities begin to resemble normal 
probabilities as n becomes large. Although the Poisson 
approximation may be derived mathematically, we can see 
it as manifested more intuitively in structure. If we 
imagine, that we are holding constant the number of successes 
likely to be ; obse rved I but allowing the number of trials to 
increase, then the experiment begins to resemble a process 
in which successes occiw^at random" across time. Such 
a process gives rise directly to a Poisson distribution. 
In this sense, the Poisson approximation is structural, 
although its derivation is frequently represented mathe- 
matically. The analogy between the Poisson approximation 
and the Po i sson ..process of stochastic-process theory is 
discussed in most elementary probability texts. 



3.5 Exercises 

To do the following exercises, use the recursive formula (4) 
for computing binomial probabilities and the corresponding formula 

(15) * p(k+l;np) = J)(k;np) 

for computing Poisson probabilities^ 

Exercise 6. Tabulate the cumulative binomial <^str^bution function 
and its normal and Poisson approximations for n = 5, 20 and 50 for 
each value of p = .5, .25, and .1 . For which values of n and p 
does each approximation appear to be valid? Which method of ap- 
proximation gives better results in the "tails" of the distribution 
when p is small? (bmpare the results of using differences between 
successive values of k in the normal approximation to the cumulative 
binomial distribution with the results of using the direct approxi- 
mation of b(k;n,p) described .by equation (7). 

Exercise 7. Recompute the normal approximations of Exercise 6 ' 
using the continuity correction, and describe its effect on the 
accuracy of the approximations. 

' ' 74 is 



Exercise 8. \ jury panel of 100 members was selected from a com- 
munity in -which 25* of the jury-eligible residents own no land. 9U 
of the panel members were land owners. How likely i^s it that non- 
land-owners are that scarce on a panel when selection is truly random 7 

Exercise 9. Suppose that in the community of Cxercise" ?tof the 
jury-eligible residents have completed /ewer than 8 yea.rs ot" school. 
What is the probability that every member of a randonvly selected jury 
panel has completed 8 or more years of school 9 . 



4 J CONCLUSION 



4 . 1 Summary 

The following diagram summarizes the approximations 
we have discussed. . j 

r 




That is, the hypergeometr ic probabilities are approxi 
rrated by j^he binomial probabilities for large populations. 
Tfie binomial probabilities in tui'n have normal and* Poisson 
approx ima t ions ; »so, therefore, do the hypergeometr ic 
probabilities. ' The diagram shows that Poisson probabili- 
ties have a normal approximation for large values of the 
parameter" but we have not discussed that approximation 
here, s , , 

<*\ A 

In all*of the populations we disdussed, the numerical 

values are either zeros or ones, representing dichotomous 
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out comes - - sucl ess or failure, yellow or pink, etc. There 
arc approximation techniques for other kinds of populations. 
.Many such techniques are in common use in statistics, 
especially techniques based in some way on Central Limit 
Theorem^.- Approximate statistical methods, based on ap- 
proximate probability calculations, are widely used by 
statisticians. Discussion of the theoretical bases tor 
approximate statistical methods' is beyond the scope of this 
module; however, the techniques have the same two bases- - 
structural and mathematical approximations. 

From these and similar approximations, you should be 
gaining the feeling that it is possible for several proba- 
bility models whose similarity is not immediately apparent 
to reflect a given sampling problem. As you progress in 
your study of inferential statistical methods, it win 
become more and more necessary for you to rely on the 
ideas of approximation in choosing mo/lels for populations 
and in deriving approximate sampling distributions for the 
statistics you will be using in reaching conclusions about 
the populations. The approximations here of hypergeometr ic 
and binomial distributions are useful as presented, for 
determining the probabilities of given numbers of succes- 
ses, but examining them should in addition give you some 
familiarity with the advantages and limitations of ap- 
proximation in general . 



4.2 Lxercises 

Lxercise IX). How might one obtain a normal approximation to hyper- 
geometric probabilities? For what values of N, n, and K would it 
be valid? ' *f 

Exercise 11. ,^ committee of 25 people is to be drawn at random from 
a group consisting of 120 men and 80 women. Obtain an approximation 
of the probability that more than half of the committee members will 
be men. 

Bxercise 12. Wilfred Shadbolt is inspecting brackets. He tests 

nt\ 20 
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30 of them, choosing the 30 randomly (without replacement) from a 
lot of 5Q00. If the 5000 include 150 defective brackets, what is the 
probability that at least one defective bracket will be among the 

30 tested'? * 

I 

Exercise 13. Show that ] 

a) as N becomes vfery large (while K/N = p remains constant), 

the coefficient of h(k;N,n,k) in formula (3) approaches the coefficient 
of b(k;n,p) in formula (4). 

b) as n becomes very large and p becomes very small (while 
np remains constant), the coefficient of b(k,n,p) in formula (4) 
approaches the coefficient of p(k,np) in formula (15). 

(Rigorous demonstration of these propositions, each of which cor- 
responds to a segment of the diagram of Section 4.1, requires some 
calculus . ) 
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5. ANSWERS TO EXERCISES 



Exercise l.(a) 
Area of triangle 



l/2xlxl = 1/2. 



Exercise 1 . (b) , / 

Slope of tangent line = 1. To find t /in gent point, set 

d f 1 / 

j£ = jjx ~ an ^ solve t0 obtain x * 1/4. Line intersects 

vertical axis at f (1/4) -1/4 =*l/4, height of line at x = 1 is 

f ( 1/4) + 3/4 = 5/4. Area of trapezoid us 1 * (1/4 + 5/4)/2 = 3/4. 



Exercise 1 . (c) 

f (1/2) = .7071; f(l) 

of second rectangle = 

Exercise 1 . (d) ' 
f (1/4) = .5; f(3/4) : 
of second rectangle = 

Exercise 1 . (e) 
x 

.125 
.250 
.375 
.500 
.625 
.750 
.875 ' 
1.000 

Exercise 1 . (f) 
x 

.0625 
.1875 
.3125 
.4375 
.5625 
.6875 
.8125 
.9375 " * 



= 1. Area of first rectangle = .3536; 
.5. Approximate area = '.8536. 



area 



.8660. Area of first rectangle = .25; area 
.4330. Approximate area = .6830. « 

■# 

f(x) Area of rectangle 

.3536 .0442 

.5000 .0625 

.6124 .0765 

.7071 .0884 

.7906 .0988 

.8660 .1083 

.9354 .1169 
1.0000 . .125t) 

Approximate Area = .7206 



f(x) 

.2500 
.4330 
.5590 
.6614 
.7500 
.8292 
.9014 
.9682 



Area of rectangle 

.0313 
.0541 
..0699 
.0827 
.0938 
.1036 
.1127 
.1210 



Approximate Area = .6691, which is very 

close to 2/3. 



78 



22 



^Exercise l t (g) 

Th6 answers to (c) and (d) are farther apart than the answers to 
(e) and (f) . Taking the height of, a rectangle to be f(x) at the • 
center of the .rectangle rather than at the edge is more critical 
to the success of the approximation when fewer, broader rectangles 
are used. 



Exercise 2. (a) 
k f(k) 



5-k 
k+1 



5! 



k! (5-k)! 




^5*4x5x2x1 , 
1 x5x4x3x2xi 

5*4x3x2/1 
1x^x3x2x1 

5x4x3x2x1 
2x1x3x2x1 

5x 4x3x2x1 
3x2*1*2x1 

$x4x,3x2xl 
4x3x2x1x1 

5x4x3x2x1 
5x4x3x2x1x1 



= 1 



= 5 



10- 



= 10 



= 5 



Exercise 2. (b) 

Suppose that f(k) * (*) . Then f(0) = (h 



S-k 



Therefore 

f(k+l) = 
f(k) 



5! 



. 0! 5! 



5! 



= 1, and 



( 5 ) - 

l k*l ; (k+1)! (4-k)i 

( 5 ) >* 5] 

V k! (5-k)/ 

k! (5-k)l 



(k+1)! (4-k)! 

k! (5-k)«(4-k)l 
(k+l)'k! (4-k)! 

5-,k 



k+1 



So f(k+l) s '^(k)> which is the recursive formula sought. 
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Exercise 3. (a) 

Exact 

k' Hypergeometric 



0.0040 
0.0992 
0.3968 
0.3968 
0.0992 
0.0040 



Binomial 
Approximation 

0.0313 
0-1563 
0.3125 
0.3125 
0.1563 
0.0313 



Exercise 3. (b) 

0 
1 



0.5000 
0.5000 



0.5905 
0.3280 



Exercise 3. (c) 



0.0281 
0.1529 
0,3189 
0.3189 
0.1529 
0.0281 



0.0313 
0.1563 
0.3125 
0.3125 
0.1563 
0.0313 



Exercise 3. (d) 



0.5838 
0.3394 
0.0702 
0.0064 
0.0003 



0.S905 
0.3280 
0.0729 
0.0081 
0.0004 



Exercise 4 . (c) 

k 

3 
4 
5 

I 5 

8 

9 
10 
11 
12 
13 
14 
15 
16 
17 



Exact 
Hypergeometric 

• 0.0004 
■ 0.0021 

0.0089 

0.0278 

0.0661 

0.1216 

0.1746 

0.1969 

0.1746 

0.1216 

0.0661 

0.0278 

0.0089 

0.0021 

0.0004 



Binomial 
Approximation 

0.0011 
0.0046 
0.0148 
0.0370 
0.0739 
0.1201 
.1602 
.1762 
.1602 

. i2di 

0,0739 
-0.0370 
0.0148 
0^0046 
0.0011 



SO 



Exercise 4. (d) * - 

Exact Binomial 

. k Hypergeoraetric Approximation 

0 ' - 0.0951 0.1216 

1 • 0.2679 0.2702 

2 0.3182 0.28.52 • 

3 0.2092 o. 1901 

4 0.0841 0.0898 

5 * 0.0215 0.0319' 

6 0.0035 0.0089 

7 - 0.0004 0.0020 

, Exercise 5. 

Rose is sampling randomly without replacement from a finite popu- 
lation of dichotomous outcomes. N is the total number of members 
•of the U.S.\ House of Representatives, which is 435. J( is the num- 
ber of Representatives supporting the bill in which Rose is inter- 
ested, n is the number of Representatives in Rose's sample, which 
is 6. k will be the number of Representatives in Rose's sample 
who support the' bill. The binomial approximation would be adequate, 
because 6 is a small fraction of 435. 
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Exercise 8. | / 

The hypergeometric probabilities can be approximated by ^ 

B(10;100,.25) * 4>( 1 Q " 1QQ( ,2 ^-^- ) = *(-3.5fe) = .0004 r 
. /100(.25)(.75) I 

Exercise 9. ' . , * 

The hypergeometric probability can be approximated by j 

0 -4 

b(0;100, 04) = p(0;4) = — — — = .018 

Exercise 10. 

Approximate the hypergeometric probabilities with binomial proba- 
bilities, and approximate the binomial probabilities with cfae of the 
normal approximations . N should be very large, K should be an appre- 
ciable fraction of N, and n should be large (but still a small frac- 
tion of N). ' \ ' 
/ ' i 

Exercise 11. 

P^number of men ^ 13) = P(number of women < 12) 
= H(12;200,25,80) 
• = B(12;25,.4) 
^ ' . - 8 t( 12-25(.4)^.5 ) 

l /25(.4)(.6) 4 
= $(-1.02) 
j i « -154 

Exercise 12. ^V- " 

P(at least one bracket defective)^ = 1 - P(no brackets defective) 

* = 1 - h(0;5000,30;i50) - 

,* 1 - b(0;30,'.03) x 

\ * 1 --p(0;.9) I 

o 0 -.9 
. 1 - \ 

| 01 

= i- .407 - .593 v * . 
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Exercise 13. (a) 

(lt-k)(n-k) = IT ( g' k) = <g ' frHn-k) 



<k*l>OMC-»k*IF (w)( N ± ^ i) ^7747^ 

As N becomes very large, £ and C** 1 become so small as to be 
negligible,* so the expression above is approximately 



K 



Because £ = p, we can write that as p which 1S the 

coefficient of b(k;f!),p) m formula (4). 

Exercise 13. (b) 

* fo-k)p np - kp " « 

(k + D(l-p) " (k+1) - (k+l)p ' 

As p becomes very small (b^t np remains constant), kp and '(k+l)p 
become so small as <to be negligible, so tie expression above 
approaches ^ /which is the coefficient of p(k;np) in formula 
US]. - \ ! 
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? 6. MODEL UNIT EXAM 

-/ 

In what sense is the binomiaj apjy&ximat ion to hyper- 
- geometric probabilities striA: tural? In what sense is 
the aonnal approx lmat ion to binomial probabilities 
structural 0 ( 

You are working for an automobil^ dealer. Invent a 
hypergeometric random variable related to your work, 
and describe what N, K, n, and k are. Can you approxi 
mate its distribution adequately with a binomial dis- 
tribution? How would you change your answer to the 
first question to make the random variable genuinely 
binomial? What would p be? 

Thomas Tolloller plays a gambling game at which he 
has probability p = .492 of winning $1 and probability 
p - .508 of losing SI. What'is the probability that, 
after 100 plays, he has won more than he -has lost? 
What is the probability that, after 100 plays, he has 
•won exactly as many times ^as he has (ost? 

Thomas 4 Tol lol ler plays another game at which he is 
told he has a 1/38 chance of winning on each play. 
After 100 plays, he has won only once. How likely is 
winning no more than once in 100 plays if the game is 
as described? 



/ . 
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7. ANSWERS TO MODEL UNIT EXAM 

The sampling schemes in binomial and hypergeometnc situations 
are similar. The bmomia^l and n6rmal distributions are both 
sampling distributions of sums, and they "can be shown mathe- 
matically to be similar for large sample sizes. 

For example, Y could be the number of people in a random sample 
of 15 of this year's customers who bought Model PQR. (the random 
sample is chosen without replacement). N would be the total num 
,ber of this year's customers; K would be the number of this year 
customers who bought Model PQR; n would be 15, the number of cus 
tomers in the sample; and k would be the number of customers in 
the sample who bought Model PQR. If the dealership is active 
this year (selling more than 75 cars, say), then' the binomial 
approximation should be adequate. To make Y genuinely binomial, 
the sample should be chosen with replacement (i.e., a customer 
could appear in the samp\e more than once.) ' P = ^. 

B(49;100,.508) * ^100(^508)^5 } m ^ ^ = . ^ 
/100(.508)(.492) 

and 

b(50; 10^.492) = 1 M 50-100( ,492)_ } 

/100 ( . 492) ( . 5D8) /100(.492) ( .508} 

. »(J60) a 0?9 • 
4.999 

B(l;100,l/38) = b(0;100,l/38) ♦ b(l ; 100,1/38) Q 
= p(0;100/38) + p(l;100/38)/= .261. 
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STUDENT FORM 1 
Request for Help / g 



Return to: 

EDC/UMAP 

55 Chapel St* 

Newton, MA 02160- 



Studeryt : If you have trouble with a specific part of this unit, please fill 
out this form and take it to your instructor for assistance. The information 
you give will help the author to revise the unit. 



Your Name 



Page 








0 Upper 


OR 


Section 


OR 


QMiddle 




Paragraph 




0 Lower 









Unit No. 



Model Exam 
Problem No, 

Text 
Problem No, 



Description of Difficulty: (Please be specific) 



Instructor : Please indicate your resolution of the difficulty in this box. 
Corrected errors in materials. List corrections here: 



( J Gave student better explanation, example, *or procedure than in unit. 
Give brief outline of "your addition here: 



o 



Assisted student in ^acquiring general learning and problem-solving 
skills (not using examples from this .unit.) ' 



ERIC 



• "38- 

Instructor 1 s Signature 



r 



Please "use reverse if necessary. 



Return to: 

STUDENT FORM 2 . EDC/UMAP 



Unit Questionnaire 

Na^e m Unit No. Date_ 

Institution Course No. * 



55 Chapel St. 
Newton, MA 02160 



Check the choice for each question that comes closest to your personal opinion, 
1 ; How useful was the amount of detail in the unit ? 
• N ot enough detail .to understand the unit 



JJnit would have been clearer with more detail 
^Appropriate amount of detail 

JJqit was occasionally too detailed, but this was not distracting 
Too much detail; I was often distracted 



How helpful were the problem answers ? : 

Sample solutions were too brief; I could not do the intermediate steps 

: Sufficient information was given to solve the problems 

Sample solutions were too detailed; I didn't need them 



3.' tocept for fulfilling the prerequisites, how much did you use other sources (for 
example, instructor, friends, or other books) in order to understand the unit? 

A Lot ■ Somewhat A Little. Not at all 



4. How long was this unit in comparison to the amount of time you generally spend on 
a lesson (lecture anc} homework assignment) in a typical math or science course?, 

Much Somewhat About Somewhat • Much 
Longer Longet * the* Same Shorter/ Shorter 



5. Were any of the following parts of the unit confusing or\distracting? (Check 
as many as apply . ) ' A 



as many as apply.) 
\ J P rerequisites 

' Statement of skills* and concepts "(objectives) 

Paragraph headings. 

Examples 

Special Assistance Supplement (if present) 



Other, please explain_ 




6. Were any of 'the following parts of the unit particularly helpful? (Check as many 
. as apply.) 

Prerequisites ^ 

Statement of skills and concepts (objectives) *** 

Examples 

Problems 



_Paragraph headings 
JTable of Contenta 

"Special Assistance Supplement (if present) V 
"Other, please explain , (_ 



Please describe anything /tn the unit that you did jiot particularly like. 




Please describe antfthiW that you If bund particularly helpful.' (Please use the back of 
this sheet if you mated toore space,) 



